Introduction

The world is divided into countries with many differences between them. Gross Domestic Product (GDP) per capita is considered as one of the primary indicators that reflect how countries are classified based on economic development. It reflects the market value of all final goods and services produced in a country over a given period of time, making it the primary indicator for a country’s economic development. Nevertheless, there are various socio-economic indicators that aid in comprehending the level of poverty, lack of healthcare, lack of education, and inequality between men and women in parts of the world. Statistical surveys can be conducted to identify the countries that face severe issues regarding socio-economic problems. Graphical representations present the results in a comprehensible way and help draw clearer conclusions.

The aim of this paper is to demonstrate the vast diversity of countries worldwide by analyzing their socio-economic data, clustering them based on similarities and differences, and applying appropriate statistical methods to accurately characterize the resulting groups.

Socio-Economic Analysis of Countries Worldwide

Data Review

The dataset is compiled from the World Bank’s Health, Nutrition, Population Statistics Database, the Gender Statistics Database and the World Development Indicators Database. The dataset contains indicators on the socio-economic life of countries around the world as of 2019.

data <- read.xlsx(file.choose(), sheetName = "Data",
                  endRow = 218, header = TRUE)
rownames(data) <- data[,1]

Variable	Description
`POP`	Population, total
`POP_GR`	Population growth (annual %)
`URB`	Urban population (% of total population)
`DENS`	Population density (people per sq. km of land area)
`BRTH`	Birth rate, crude (per 1,000 people)
`DTH`	Death rate, crude (per 1,000 people)
`GDP`	GDP per capita (current US$)
`GDP_GR`	GDP growth (annual %)
`LEXP_M`	Life expectancy at birth, male (years)
`LEXP_F`	Life expectancy at birth, female (years)
`FERT`	Fertility rate, total (births per woman)
`MOR`	Mortality rate, infant (per 1,000 live births)
`HLTH`	Current health expenditure (% of GDP)
`SUIC_F`	Suicide mortality rate, female (per 100,000 female population)
`SUIC_M`	Suicide mortality rate, male (per 100,000 female population)
`IMM`	Immunization, DPT (% of children ages 12-23 months)
`ELEC`	Access to electricity (% of population)
`INT`	Individuals using the Internet (% of population)
`WTR`	People using at least basic drinking water services (% of population)
`WAGE_F`	Wage and salaried workers, female (% of female employment)
`WAGE_M`	Wage and salaried workers, male (% of male employment)
`LBR`	Labor force, total
`EDU_F`	School enrollment, secondary, female (% gross)
`EDU_M`	School enrollment, secondary, male (% gross)

Structure and Summary of the Data

The dataset contains 217 observations (countries) and 26 variables (socio-economic indicators).

Structure

str(data)

## 'data.frame':    217 obs. of  26 variables:
##  $ CNTRY : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ CODE  : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ POP   : num  38041757 2854191 43053054 55312 77146 ...
##  $ POP_GR: num  2.313 -0.426 1.934 -0.269 0.179 ...
##  $ URB   : num  25.8 61.2 73.2 87.1 88 ...
##  $ DENS  : num  58.3 104.2 18.1 276.6 164.1 ...
##  $ BRTH  : num  31.8 11.6 23.6 NA 7 ...
##  $ DTH   : num  6.29 8.08 4.72 NA 3.9 ...
##  $ GDP   : num  494 5396 3990 11715 40897 ...
##  $ GDP_GR: num  3.912 2.113 1 -0.488 2.016 ...
##  $ LEXP_F: num  66.4 80.2 78.1 NA NA ...
##  $ LEXP_M: num  63.4 77 75.7 NA NA ...
##  $ FERT  : num  4.32 1.6 2.99 NA NA ...
##  $ MORT  : num  46.4 8.6 20 NA 2.5 49.9 5.6 8.1 10.2 NA ...
##  $ HLTH  : num  13.24 NA 6.24 NA 6.71 ...
##  $ SUIC_F: num  3.6 2.7 1.8 NA NA 2.3 0.8 3.3 1.3 NA ...
##  $ SUIC_M: num  4.6 5.9 3.1 NA NA 10 0 13.7 5.6 NA ...
##  $ IMM   : num  72 99 91 NA 99 57 95 83 92 NA ...
##  $ ELEC  : num  97.7 100 99.5 NA 100 ...
##  $ INT   : num  17.6 68.6 57.9 NA NA ...
##  $ WTR   : num  72.4 94.8 94.2 99.8 100 ...
##  $ WAGE_F: num  8.21 47.87 74.52 NA NA ...
##  $ WAGE_M: num  20.3 44.2 66.4 NA NA ...
##  $ LBR   : num  10382034 1421853 12500369 NA NA ...
##  $ EDU_F : num  NA 95.7 NA NA NA ...
##  $ EDU_M : num  NA 94.6 NA NA NA ...

Summary

summary(data)

##     CNTRY               CODE                POP                POP_GR       
##  Length:217         Length:217         Min.   :1.076e+04   Min.   :-1.6095  
##  Class :character   Class :character   1st Qu.:7.779e+05   1st Qu.: 0.4265  
##  Mode  :character   Mode  :character   Median :6.661e+06   Median : 1.1058  
##                                        Mean   :3.545e+07   Mean   : 1.1996  
##                                        3rd Qu.:2.544e+07   3rd Qu.: 1.9556  
##                                        Max.   :1.408e+09   Max.   : 4.4687  
##                                        NA's   :1           NA's   :1        
##       URB              DENS                BRTH            DTH        
##  Min.   : 13.25   Min.   :    0.137   Min.   : 5.90   Min.   : 1.244  
##  1st Qu.: 42.54   1st Qu.:   38.177   1st Qu.:10.61   1st Qu.: 5.708  
##  Median : 62.03   Median :   92.872   Median :17.05   Median : 7.118  
##  Mean   : 61.21   Mean   :  446.165   Mean   :19.30   Mean   : 7.531  
##  3rd Qu.: 80.67   3rd Qu.:  233.011   3rd Qu.:26.87   3rd Qu.: 9.112  
##  Max.   :100.00   Max.   :19466.444   Max.   :45.64   Max.   :15.500  
##  NA's   :3        NA's   :1           NA's   :11      NA's   :12      
##       GDP               GDP_GR            LEXP_F          LEXP_M     
##  Min.   :   228.2   Min.   :-11.143   Min.   :55.49   Min.   :51.08  
##  1st Qu.:  2369.7   1st Qu.:  1.183   1st Qu.:69.83   1st Qu.:65.73  
##  Median :  7027.6   Median :  2.605   Median :77.30   Median :71.30  
##  Mean   : 18605.3   Mean   :  2.811   Mean   :75.44   Mean   :70.58  
##  3rd Qu.: 23330.8   3rd Qu.:  4.778   3rd Qu.:81.05   3rd Qu.:76.21  
##  Max.   :189487.1   Max.   : 19.536   Max.   :88.10   Max.   :82.60  
##  NA's   :12         NA's   :14        NA's   :18      NA's   :18     
##       FERT            MORT            HLTH            SUIC_F      
##  Min.   :0.918   Min.   : 1.60   Min.   : 1.525   Min.   : 0.300  
##  1st Qu.:1.659   1st Qu.: 5.70   1st Qu.: 4.444   1st Qu.: 2.000  
##  Median :2.181   Median :14.30   Median : 6.272   Median : 3.300  
##  Mean   :2.612   Mean   :20.97   Mean   : 6.595   Mean   : 4.351  
##  3rd Qu.:3.518   3rd Qu.:31.50   3rd Qu.: 8.202   3rd Qu.: 6.050  
##  Max.   :6.824   Max.   :82.40   Max.   :23.962   Max.   :30.100  
##  NA's   :17      NA's   :24      NA's   :31       NA's   :34      
##      SUIC_M            IMM             ELEC              INT       
##  Min.   :  0.00   Min.   :35.00   Min.   :  6.707   Min.   : 5.60  
##  1st Qu.:  6.70   1st Qu.:85.00   1st Qu.: 84.223   1st Qu.:35.94  
##  Median : 11.30   Median :93.00   Median : 99.996   Median :68.09  
##  Mean   : 14.65   Mean   :88.34   Mean   : 86.352   Mean   :60.99  
##  3rd Qu.: 17.95   3rd Qu.:97.00   3rd Qu.:100.000   3rd Qu.:83.17  
##  Max.   :116.00   Max.   :99.00   Max.   :100.000   Max.   :99.70  
##  NA's   :34       NA's   :24      NA's   :1         NA's   :61     
##       WTR             WAGE_F          WAGE_M           LBR           
##  Min.   : 38.21   Min.   : 1.05   Min.   : 7.28   Min.   :    31687  
##  1st Qu.: 84.02   1st Qu.:29.75   1st Qu.:43.90   1st Qu.:  1055847  
##  Median : 96.90   Median :66.91   Median :65.76   Median :  4155002  
##  Mean   : 89.10   Mean   :59.42   Mean   :60.59   Mean   : 18416926  
##  3rd Qu.: 99.87   3rd Qu.:88.17   3rd Qu.:80.45   3rd Qu.: 12534828  
##  Max.   :100.00   Max.   :99.47   Max.   :99.61   Max.   :800020955  
##  NA's   :16       NA's   :30      NA's   :30      NA's   :30         
##      EDU_F            EDU_M       
##  Min.   : 14.28   Min.   : 26.81  
##  1st Qu.: 86.40   1st Qu.: 83.70  
##  Median : 99.04   Median : 97.61  
##  Mean   : 94.38   Mean   : 93.26  
##  3rd Qu.:106.63   3rd Qu.:108.16  
##  Max.   :165.74   Max.   :146.90  
##  NA's   :101      NA's   :101

The data needs to be edited before statistical methods can be used to analyze it.

data <- na.omit(round(data[,-c(1,2)],2))
nrow(data)

## [1] 94

Observations with missing values in their respective rows are removed from the dataset, which results in the exclusion of 113 countries, leaving a total of 94 countries in the dataset.

Correlation Coefficients Between Variables

The correlation coefficients between the variables are calculated and a correlation matrix is constructed from them. The resulting matrix values are displayed.

cordata <- cor(data)
corrplot(cor(data), method = "color", tl.col = "black", tl.cex = 0.9)

The blue color denotes a positive correlation between variables, whereas the red color represents a negative correlation. The correlation matrix depicts several interesting relationships between the variables:

Access to electricity is strongly positively correlated with access to at least basic drinking water services. Access to electricity and drinking water are basic needs in society. Since both are essential requirements of a society and are being developed at a similar rate, they should be accessible at a similar level in countries.
Population has a very strong positive correlation with labor force. It is natural that countries with a larger population tend to have a higher number of people who are employed or looking for work.
The share of urban population is strongly positively correlated with access to the internet. The higher the share of urban population, the more developed the country’s economy is.
Birth rate, fertility rate and infant mortality are strongly negatively correlated with life expectancy, access to electricity, internet, at least basic drinking water services and the number of men and women who have completed secondary education. In general, countries with higher birth rates and infant mortality rates tend to have lower access to basic public services, such as access to electricity and internet.
Mortality is weakly positively correlated with access to secondary education and the number of employees. The stress levels experienced by individuals at schools or workplaces can contribute to higher mortality rates in countries.

Indeed, there are many other interesting relationships between variables. In order to identify clearly understandable relationships, the 24 variables can be reduced to a smaller number of factors.

Factor Extraction and Interpretation

Checking if Data Is Suitable for Factor Analysis

Bartlett’s Test of Sphericity

It is checked whether the data is suitable for factor analysis. Bartlett’s test of sphericity is applied.

cortest.bartlett(cor(data), n = nrow(data))

## $chisq
## [1] 3305.226
## 
## $p.value
## [1] 0
## 
## $df
## [1] 276

After performing the test, we found that the p-value is smaller than the chosen significance level α = 0.05. Therefore, the null hypothesis is rejected. The data is suitable for factor analysis.

The Kaiser-Meyer-Olkin Measure

The Kaiser-Meyer-Olkin measure is calculated.

KMO(data)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = data)
## Overall MSA =  0.83
## MSA for each item = 
##    POP POP_GR    URB   DENS   BRTH    DTH    GDP GDP_GR LEXP_F LEXP_M   FERT 
##   0.42   0.79   0.90   0.28   0.84   0.52   0.90   0.71   0.88   0.83   0.85 
##   MORT   HLTH SUIC_F SUIC_M    IMM   ELEC    INT    WTR WAGE_F WAGE_M    LBR 
##   0.92   0.78   0.69   0.68   0.79   0.87   0.91   0.90   0.90   0.90   0.42 
##  EDU_F  EDU_M 
##   0.87   0.87

The obtained measure value of 0.83 indicates strong partial correlation. Thus, factor analysis can be applied.

The number of factors is determined by applying the Kaiser criterion and drawing a scree plot, where the x-axis indicates the number of factors, and the y-axis shows the corresponding eigenvalues.

ev <- eigen(cor(data))
plot(
  x = seq(1:length(ev$values)), y = ev$values, type = "o",
  xlab = "Factors", ylab = "Eigenvalues",
  main = "Scree plot");
abline(1, 0, col = "red", lty = 2)

The number of factors to be extracted is the number of points on the graph with test values greater than 1. Five factors have test values greater than 1, but we choose to extract only four factors for better interpretability. In this case, therefore, 4 factors are distinguished.

The correlation matrix of the original variables is subjected to factor analysis using the maximum likelihood method.

fakt <- factanal(covmat = cor(data), factors = 4,
                 method = "mle", lower = 0.01)
fakt$loadings

## 
## Loadings:
##        Factor1 Factor2 Factor3 Factor4
## POP    -0.106                   0.989 
## POP_GR -0.276  -0.571  -0.583         
## URB     0.793                         
## DENS           -0.343   0.134         
## BRTH   -0.755  -0.228  -0.605         
## DTH             0.900   0.260         
## GDP     0.735   0.153                 
## GDP_GR -0.438  -0.134                 
## LEXP_F  0.878           0.346         
## LEXP_M  0.888           0.264         
## FERT   -0.716          -0.679         
## MORT   -0.844  -0.113  -0.342         
## HLTH    0.517   0.370                 
## SUIC_F  0.320   0.540           0.257 
## SUIC_M  0.255   0.612   0.110         
## IMM     0.485           0.107         
## ELEC    0.687           0.492         
## INT     0.909   0.159   0.161         
## WTR     0.757           0.495         
## WAGE_F  0.855   0.253   0.182         
## WAGE_M  0.836   0.160   0.138  -0.128 
## LBR                             0.991 
## EDU_F   0.822   0.136   0.193         
## EDU_M   0.840   0.150   0.188         
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings     10.169   2.356   2.237   2.104
## Proportion Var   0.424   0.098   0.093   0.088
## Cumulative Var   0.424   0.522   0.615   0.703

The table indicates the proportion of the total variance of the variables explained by each factor. The cumulative variance indicates the proportion of the total variance of the explanatory variables explained by each factor. Factor 1 explains 42% of the total variance, Factor 2 10%, Factor 3 9% and Factor 4 9%. The identified factors explain 70% of the total variance of the variables, which indicates a reasonably good model.

The loadings of the factors in the expression of the original variables through the factors are given.

Loadings

Heatmap

pheatmap(fakt$loadings, angle_col = 45, cluster_cols = F)

Table

	Factor 1	Factor 2	Factor 3	Factor 4
Population, total	-0.106			0.989
Population growth (annual %)	-0.276	-0.571	-0.583
Urban population (% of total population)	0.793
Population density (people per sq. km of land area)		-0.343	0,134
Birth rate, crude (per 1,000 people)	-0.755	-0.228	-0.605
Death rate, crude (per 1,000 people)		0.900	0.260
GDP per capita (current US$)	0.735	0.153
GDP growth (annual %)	-0.438	-0.134
Life expectancy at birth, male (years)	0.878		0.346
Life expectancy at birth, female (years)	0.888		0.264
Fertility rate, total (births per woman)	-0.716		-0.679
Mortality rate, infant (per 1,000 live births)	-0.844	-0.113	-0.342
Current health expenditure (% of GDP)	0.517	0.370
Suicide mortality rate, female (per 100,000 female population)	0.320	0.540		0.257
Suicide mortality rate, male (per 100,000 female population)	0.255	0.612	0.110
Immunization, DPT (% of children ages 12-23 months)	0.485		0.107
Access to electricity (% of population)	0.687		0.492
Individuals using the Internet (% of population)	0.909	0.159	0.160
People using at least basic drinking water services (% of population)	0.757		0.495
Wage and salaried workers, female (% of female employment)	0.855	0.253	0.182
Wage and salaried workers, male (% of male employment)	0.836	0.160	0.138	-0.128
Labor force, total				0.991
School enrollment, secondary, female (% gross)	0.822	0.136	0.193
School enrollment, secondary, male (% gross)	0.840	0.150	0.188

These values represent the extent of the variability explained by each variable in each factor. The loadings of the factors are utilized for interpreting them as:

Factor 1 is reflective of the economic development of a country. It has the highest loadings for urban population, GDP per capita, life expectancy, number of employees, number of inhabitants with secondary education, access to the internet, access to basic drinking water services, electricity.
Factor 2 represents the overall quality of life in a country. It has the highest loadings on mortality and suicide.
Factor 3 reflects access to fundamental services in society. It has the highest loadings on access to electricity and basic drinking water services.
Factor 4 is indicative of the number of individuals working or looking for work in the country. It has the highest loadings on population and labor force.

The original data is plotted on the coordinate plane of the first two factors.

scores <- factanal(data, factors = 4, method = "mle", lower = 0.01,
                   scores = "regression")$scores
plot(scores[,1], scores[,2], xlab = "Factor 1", 
     ylab = "Factor 2", 
     main = "The projection of world countries onto the plane of the first two factors", 
     type = "n")
text(scores[,1], scores[,2], labels = rownames(scores), cex = 0.7)

The graph illustrates that the lowest quality of life is observed in European countries such as Latvia, Lithuania, and Bulgaria, whereas the highest quality of life is seen in the Maldives, Bahrain, and the United Arab Emirates. The countries in the Middle East (Bahrain, UAE, Oman, Saudi Arabia, etc.) exhibit strong economic development and a very high quality of life. Chad is distinctive from the rest in that it has both a very poor economic development and a very poor overall quality of life.

Country Clustering

As the units of measurement of the variables in the dataset differ, standardization of the data is performed. Once the units of measurement are standardised, cluster analysis can be performed.

data_sc <- scale(data)

Hierarchical Clustering

It is necessary to decide according to which similarity measure the distance between clusters will be evaluated. The agglomerative coefficient is calculated, which measures the amount of clustering structure found.

m <- c("single", "complete", "average", "ward")
names(m) <- c("single", "complete", "average", "ward")
ac <- function(x) {
  agnes(data_sc, method = x)$ac
}
round(map_dbl(m, ac),2)

##   single complete  average     ward 
##     0.79     0.83     0.80     0.93

Ward’s method identifies the strongest clustering structure of the four methods assessed, because values closer to 1 suggest strong clustering structure. The Ward’s method is applied to the standardized data, and a dendrogram is generated.

plot(hc, cex =0.5, col = "black")
rect.hclust(hc, k = 6, border = 2:7)

After reviewing the dendrogram, it was concluded that 6 clusters will be used in further analysis. The associated countries for each cluster are listed below.

Lists of Countries in Each Cluster

clust <- as.data.frame(cbind(names(cutree(hc, k = 6)),(cutree(hc, k = 6))))
rownames(clust) <- NULL

Cluster 1

clust1 <- clust[clust[,2]==1,]
colnames(clust1) <- c("Country", "Cluster")
clust1

##                     Country Cluster
## 1                   Armenia       1
## 3                   Austria       1
## 12                 Bulgaria       1
## 22                     Cuba       1
## 24           Czech Republic       1
## 31                  Estonia       1
## 33                   France       1
## 34                  Georgia       1
## 35                  Germany       1
## 37                   Greece       1
## 40                  Hungary       1
## 45                    Italy       1
## 47              South Korea       1
## 50                   Latvia       1
## 51                Lithuania       1
## 58                Mauritius       1
## 61               Montenegro       1
## 71                   Poland       1
## 72                 Portugal       1
## 73                  Romania       1
## 74                   Russia       1
## 78                   Serbia       1
## 79                 Slovakia       1
## 80                 Slovenia       1
## 81             South Africa       1
## 83              Saint Lucia       1
## 87                 Thailand       1
## 92 United States of America       1
## 93                  Uruguay       1

Cluster 2

clust2 <- clust[clust[,2]==2,]
colnames(clust2) <- c("Country", "Cluster")
clust2

##           Country Cluster
## 2       Australia       2
## 7         Belgium       2
## 16         Canada       2
## 18          Chile       2
## 20     Costa Rica       2
## 25        Denmark       2
## 32        Finland       2
## 41        Iceland       2
## 43        Ireland       2
## 52     Luxembourg       2
## 64    Netherlands       2
## 65    New Zealand       2
## 66         Norway       2
## 82          Spain       2
## 84         Sweden       2
## 85    Switzerland       2
## 91 United Kingdom       2

Cluster 3

clust3 <- clust[clust[,2]==3,]
colnames(clust3) <- c("Country", "Cluster")
clust3

##               Country Cluster
## 4          Azerbaijan       3
## 8              Belize       3
## 9             Bolivia       3
## 10             Brazil       3
## 15         Cape Verde       3
## 19           Colombia       3
## 27 Dominican Republic       3
## 28            Ecuador       3
## 29              Egypt       3
## 30        El Salvador       3
## 38          Guatemala       3
## 39           Honduras       3
## 46         Kazakhstan       3
## 48         Kyrgyzstan       3
## 59             Mexico       3
## 60           Mongolia       3
## 62            Morocco       3
## 69               Peru       3
## 70        Philippines       3
## 94         Uzbekistan       3

Cluster 4

clust4 <- clust[clust[,2]==4,]
colnames(clust4) <- c("Country", "Cluster")
clust4

##                 Country Cluster
## 5               Bahrain       4
## 11    Brunei Darussalam       4
## 23               Cyprus       4
## 44               Israel       4
## 54             Malaysia       4
## 55             Maldives       4
## 56                Malta       4
## 67                 Oman       4
## 76         Saudi Arabia       4
## 89               Turkey       4
## 90 United Arab Emirates       4

Cluster 5

clust5 <- clust[clust[,2]==5,]
colnames(clust5) <- c("Country", "Cluster")
clust5

##          Country Cluster
## 6     Bangladesh       5
## 13  Burkina Faso       5
## 14       Burundi       5
## 17          Chad       5
## 21 Cote d'Ivoire       5
## 26      Djibouti       5
## 36         Ghana       5
## 49          Laos       5
## 53        Malawi       5
## 57    Mauritania       5
## 63         Nepal       5
## 68      Pakistan       5
## 75        Rwanda       5
## 77       Senegal       5
## 86      Tanzania       5
## 88    East Timor       5

Cluster 6

clust6 <- clust[clust[,2]==6,]
colnames(clust6) <- c("Country", "Cluster")
clust6

##    Country Cluster
## 42   India       6

Country codes are added to the dataset.

malDF <- data.frame(country = c(
  # Country codes for cluster 1
  "ARM", "AUT", "BGR", "CUB", "CZE", "EST", "FRA",
  "GEO", "DEU", "GRC", "HUN", "ITA", "KOR", "LVA",
  "LTU", "MUS", "MNE", "POL", "PRT", "ROU", "RUS",
  "SRB", "SVK", "SVN", "ZAF", "LCA", "THA", "USA",
  "URY",
  # Country codes for cluster 2
  "AUS", "BEL", "CAN", "CHL", "CRI", "DNK", "FIN",
  "ISL", "IRL", "LUX", "NLD", "NZL", "NOR", "ESP",
  "SWE", "CHE", "GBR",
  # Country codes for cluster 3
  "AZE", "BLZ", "BOL", "BRA", "CPV", "COL", "DOM",
  "ECU", "EGY", "SLV", "GTM", "HND", "KAZ", "KGZ",
  "MEX", "MNG", "MAR", "PER", "PHL", "UZB",
  # Country codes for cluster 4
  "BHR", "BRN", "CYP", "ISR", "MYS", "MDV", "MLT",
  "OMN", "SAU", "TUR", "ARE",
  # Country codes for cluster 5
  "BGD", "BFA", "BDI", "TCD", "CIV", "DJI", "GHA",
  "LAO", "MWI", "MRT", "NPL", "PAK", "RWA", "SEN",
  "TZA", "TLS",
  # Country codes for cluster 6
  "IND"),
  Clusters = c(rep(1,29),rep(2,17),rep(3,20),
                 rep(4,11),rep(5,16),6))
clusterNames <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")
malMap <- joinCountryData2Map(malDF, joinCode = "ISO3",
                              nameJoinColumn = "country")

## 94 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 149 codes from the map weren't represented in your data

The countries assigned to the clusters are shown on the map with different colours.

mapP <- mapCountryData(malMap, nameColumnToPlot = "Clusters",
               catMethod = "categorical", missingCountryCol = "lightgrey",
               colourPalette = c("red2", "seagreen", "navy",
                                 "royalblue", "mediumvioletred", 
                                 "yellow2"), addLegend=TRUE)

Countries that belong to different clusters are marked on the map. You can see how the clusters of countries are distributed geographically.

Cluster 1 comprises countries from diverse regions, including Africa, Asia, South America, Central America, Northern America, and primarily Europe.
Cluster 2 encompasses countries from different parts of Europe, South America, Central America, Northern America, and Australia.
Countries in Cluster 3 are distributed across Africa, Asia, South America, Central America, and Northern America.
Cluster 4 includes countries mainly located on the Asian continent.
Countries in Cluster 5 are located in Africa and Asia.
Cluster 6 consists of India alone.

K-means Clustering

A k-means clustering is also performed.

set.seed(1)
km <- kmeans(data_sc, centers = 6)

In this non-hierarchical cluster analysis method, the number of clusters is chosen in advance. Since the hierarchical cluster analysis resulted in 6 clusters in the printed dendrogram, the same number of clusters are used for k-means clustering. The associated countries for each cluster are listed below.

Lists of Countries in Each Cluster

clustt <- cbind(data, cluster=km$cluster)

Cluster 1

clustt1_var <- clustt[clustt$cluster==6,]
clustt1_var$cluster[clustt1_var$cluster == 6] <- 1
clustt1 <- as.data.frame(cbind(rownames(clustt1_var), clustt1_var[,25]))
colnames(clustt1) <- c("Country", "Cluster")
clustt1

##                     Country Cluster
## 1                 Australia       1
## 2                   Austria       1
## 3                   Belgium       1
## 4                    Canada       1
## 5                   Denmark       1
## 6                   Finland       1
## 7                    France       1
## 8                   Germany       1
## 9                   Iceland       1
## 10                  Ireland       1
## 11              South Korea       1
## 12               Luxembourg       1
## 13              Netherlands       1
## 14              New Zealand       1
## 15                   Norway       1
## 16                    Spain       1
## 17                   Sweden       1
## 18              Switzerland       1
## 19           United Kingdom       1
## 20 United States of America       1

Cluster 2

clustt2_var <- clustt[clustt$cluster==2,]
clustt2 <- as.data.frame(cbind(rownames(clustt2_var), clustt2_var[,25]))
colnames(clustt2) <- c("Country", "Cluster")
clustt2

##               Country Cluster
## 1             Armenia       2
## 2          Azerbaijan       2
## 3          Bangladesh       2
## 4              Belize       2
## 5             Bolivia       2
## 6              Brazil       2
## 7          Cape Verde       2
## 8            Colombia       2
## 9  Dominican Republic       2
## 10            Ecuador       2
## 11              Egypt       2
## 12        El Salvador       2
## 13          Guatemala       2
## 14           Honduras       2
## 15         Kyrgyzstan       2
## 16               Laos       2
## 17             Mexico       2
## 18           Mongolia       2
## 19            Morocco       2
## 20              Nepal       2
## 21               Peru       2
## 22        Philippines       2
## 23        Saint Lucia       2
## 24           Thailand       2
## 25         Uzbekistan       2

Cluster 3

clustt3_var <- clustt[clustt$cluster==4,]
clustt3_var$cluster[clustt3_var$cluster == 4] <- 3
clustt3 <- as.data.frame(cbind(rownames(clustt3_var), clustt3_var[,25]))
colnames(clustt3) <- c("Country", "Cluster")
clustt3

##           Country Cluster
## 1        Bulgaria       3
## 2            Cuba       3
## 3  Czech Republic       3
## 4         Estonia       3
## 5         Georgia       3
## 6          Greece       3
## 7         Hungary       3
## 8           Italy       3
## 9      Kazakhstan       3
## 10         Latvia       3
## 11      Lithuania       3
## 12      Mauritius       3
## 13     Montenegro       3
## 14         Poland       3
## 15       Portugal       3
## 16        Romania       3
## 17         Russia       3
## 18         Serbia       3
## 19       Slovakia       3
## 20       Slovenia       3
## 21   South Africa       3
## 22        Uruguay       3

Cluster 4

clustt4_var <- clustt[clustt$cluster==1,]
clustt4_var$cluster[clustt4_var$cluster == 1] <- 4
clustt4 <- as.data.frame(cbind(rownames(clustt4_var), clustt4_var[,25]))
colnames(clustt4) <- c("Country", "Cluster")
clustt4

##          Country Cluster
## 1   Burkina Faso       4
## 2        Burundi       4
## 3           Chad       4
## 4  Cote d'Ivoire       4
## 5       Djibouti       4
## 6          Ghana       4
## 7         Malawi       4
## 8     Mauritania       4
## 9       Pakistan       4
## 10        Rwanda       4
## 11       Senegal       4
## 12      Tanzania       4
## 13    East Timor       4

Cluster 5

clustt5_var <- clustt[clustt$cluster==5,]
clustt5 <- as.data.frame(cbind(rownames(clustt5_var), clustt5_var[,25]))
colnames(clustt5) <- c("Country", "Cluster")
clustt5

##                 Country Cluster
## 1               Bahrain       5
## 2     Brunei Darussalam       5
## 3                 Chile       5
## 4            Costa Rica       5
## 5                Cyprus       5
## 6                Israel       5
## 7              Malaysia       5
## 8              Maldives       5
## 9                 Malta       5
## 10                 Oman       5
## 11         Saudi Arabia       5
## 12               Turkey       5
## 13 United Arab Emirates       5

Cluster 6

clustt6_var <- clustt[clustt$cluster==3,]
clustt6_var$cluster[clustt6_var$cluster == 3] <- 6
clustt6 <- as.data.frame(cbind(rownames(clustt6_var), clustt6_var[,25]))
colnames(clustt6) <- c("Country", "Cluster")
clustt6

##   Country Cluster
## 1   India       6

Country codes are added to the dataset.

malDF <- data.frame(country = c(
  # Country codes for cluster 1
  "AUS", "AUT", "BEL", "CAN", "DNK", "FIN", "FRA",
  "DEU", "ISL", "IRL", "KOR", "LUX", "NLD", "NZL",
  "NOR", "ESP", "SWE", "CHE", "GBR", "USA",
  # Country codes for cluster 2
  "ARM", "AZE", "BGD", "BLZ", "BOL", "BRA", "CPV",
  "COL", "DOM", "ECU", "EGY", "SLV", "GTM", "HND",
  "KGZ", "LAO", "MEX", "MNG", "MAR", "NPL", "PER",
  "PHL", "LCA", "THA", "UZB",
  # Country codes for cluster 3
  "BGR", "CUB", "CZE", "EST", "GEO", "GRC", "HUN",
  "ITA", "KAZ", "LVA", "LTU", "MUS", "MNE", "POL",
  "PRT", "ROU", "RUS", "SRB", "SVK", "SVN", "ZAF",
  "URY",
  # Country codes for cluster 4
  "BHR", "BDI", "TCD", "DJI", "CIV", "GHA", "MWI",
  "MRT", "PAK", "RWA", "SEN", "TZA", "TLS",
  # Country codes for cluster 5
  "BHR", "BRN", "CHL", "CRI", "CYP", "ISR", "MYS",
  "MDV", "MLT", "OMN", "SAU", "TUR", "ARE",
  # Country codes for cluster 6
  "IND"),
  Clusters = c(rep("Cluster 1",nrow(clustt1)),rep("Cluster 2",nrow(clustt2)),rep("Cluster 3",nrow(clustt3)),
               rep("Cluster 4",nrow(clustt4)),rep("Cluster 5",nrow(clustt5)),"Cluster 6"))
clusterNames <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")
malMap <- joinCountryData2Map(malDF, joinCode = "ISO3",
                              nameJoinColumn = "country")

## 94 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 150 codes from the map weren't represented in your data

The countries assigned to the clusters are shown on the map with different colours.

mapP <- mapCountryData(malMap, nameColumnToPlot = "Clusters",
                       catMethod = "categorical",missingCountryCol = "lightgrey",
                       colourPalette = c("red2", "seagreen", "navy",
                                         "royalblue", "mediumvioletred", 
                                         "yellow2"), addLegend=TRUE)

Countries that belong to different clusters are marked on the map. You can see how the clusters of countries are distributed geographically.

Cluster 1 comprises countries in North America, Australia, and primarily Europe.
Cluster 2 encompasses countries across several continents, including Asia, Africa, South America, Central America, and Northern America.
Cluster 3 includes countries distributed across Asia, the Baltic States, and mainly Eastern and Southern Europe. Additionally, one country from this cluster is located in Africa, one in Central America, and another in South America.
Cluster 4 includes countries primarily located on the African continent, with one country also in Asia.
Cluster 5 comprises countries in the Middle East, Central America, South America, and Asia.
Cluster 6 is exclusively composed of India.

Often, countries that are geographically close have more similarities in terms of historical background and existing culture. These similarities also affect the socio-economic indicators of the countries. It can be observed that in the map showing countries extracted using non-hierarchical cluster analysis, the countries belonging to each cluster are geographically slightly closer to each other than in the map showing countries extracted using hierarchical cluster analysis. For this reason, the countries extracted by the k-means method will be used for further interpretation and analysis.

The countries identified through the k-means method will be utilized for subsequent analysis and interpretation.

Cluster Interpretation of K-means Results

In the light of the factors highlighted above, the countries extracted by the k-means method are subject to cluster interpretation:

Cluster 1 (e.g., Australia, Sweden, South Korea,…) consists of highly developed countries with full access to basic services in society, but a lower overall quality of life. A large portion of the population is employed or looking for work.
Cluster 2 (e.g., Colombia, Nepal, Azerbaijan,…) comprises economically underdeveloped countries with good overall quality of life, but poor access to basic services in society.
Cluster 3 (e.g., Lithuania, Greece, Cuba,…) includes developed countries with full access to basic services in society, but a lower overall quality of life and lower employment rates.
Cluster 4 (e.g., Djibouti, Pakistan, Tanzania,…) contains very economically underdeveloped countries with average overall quality of life, but poor access to basic services in society and lower employment rates.
Cluster 5 (e.g., Cyprus, Oman, Chile,…) consists of developed countries with full access to basic services in society and good overall quality of life.
Cluster 6 (India) is an economically underdeveloped country with poor access to basic services in society, average overall quality of life, and exceptionally high employment rates.

Principal Component Identification and Interpretation

A principal component analysis is performed on standardized baseline data.

pca <- princomp(data_sc)
summary(pca, importance = TRUE)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     3.4661706 1.5876652 1.43132211 1.17114920 1.07814493
## Proportion of Variance 0.5059802 0.1061577 0.08627966 0.05776411 0.04895397
## Cumulative Proportion  0.5059802 0.6121379 0.69841756 0.75618168 0.80513565
##                            Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     0.91978272 0.86909633 0.79280023 0.73551691 0.62617326
## Proportion of Variance 0.03562904 0.03181043 0.02647044 0.02278342 0.01651288
## Cumulative Proportion  0.84076470 0.87257512 0.89904556 0.92182899 0.93834186
##                           Comp.11    Comp.12     Comp.13     Comp.14
## Standard deviation     0.60404575 0.56849465 0.416171361 0.381876678
## Proportion of Variance 0.01536644 0.01361089 0.007294206 0.006141577
## Cumulative Proportion  0.95370831 0.96731919 0.974613399 0.980754977
##                            Comp.15     Comp.16    Comp.17     Comp.18
## Standard deviation     0.353216832 0.294074338 0.27972160 0.245110833
## Proportion of Variance 0.005254319 0.003642067 0.00329523 0.002530222
## Cumulative Proportion  0.986009296 0.989651363 0.99294659 0.995476815
##                            Comp.19     Comp.20      Comp.21      Comp.22
## Standard deviation     0.192458862 0.179906694 0.1245330421 0.1141824713
## Proportion of Variance 0.001559946 0.001363102 0.0006531348 0.0005490761
## Cumulative Proportion  0.997036760 0.998399862 0.9990529971 0.9996020732
##                             Comp.23      Comp.24
## Standard deviation     0.0720568309 0.0652415321
## Proportion of Variance 0.0002186674 0.0001792594
## Cumulative Proportion  0.9998207406 1.0000000000

There are 24 variables in total, which means there are 24 components. The tables show the standard deviations (or matrix test values) of the first 10 components and the proportion of variance explained and the cumulative proportion of variance explained of each component. The standard deviations of the first five components are greater than 1, indicating that they explain more variance than individual variables. The remaining components have standard deviations smaller than 1 and account for less variance than the individual variables. Each successive component explains less of the variance. The first three explain 70% of the total variance.

Contribution of Variables to Dim-1

var <- get_pca_var(pca)
a <- fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)
plot(a)

Contribution of Variables to Dim-2

var <- get_pca_var(pca)
a <- fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
plot(a)

Contribution of Variables to Dim-1-2

var <- get_pca_var(pca)
a <- fviz_contrib(pca, "var", axes=1:2, xtickslab.rt=90)
plot(a)

A graphical representation is generated for the initial two principal components, which account for the majority of the variance.

plot(pca$scores,type = "n", xlim = xlim, ylim = xlim, 
     main = "Scatterplot of First Two Principal Components")
text(pca$scores[,1], pca$scores[,2], labels = rownames(data_sc), cex = 0.5)

This scatterplot offers valuable insights into the relationships among countries, and provides a clear depiction of those that fall outside the expected patterns. Notably, the East African countries (Burundi, Tanzania, Malawi, Rwanda) and West African countries (Burkina Faso, Ivory Coast, Mauritania) are clustered together. Meanwhile, the Central African nation of Chad is an outlier, although it is grouped with other African countries. India is also a standout, most likely due to its exceedingly large population. Furthermore, clusters consisting of larger numbers of countries are formed, such as Colombia, Morocco, Ecuador, Cape Verde, Belize, and others. Additionally, European countries such as Germany, France, Austria, Portugal, and Estonia are also grouped together.

Hypothesis Testing for Clusters

Gender Suicide Rates Comparison

Within each cluster, a one-sided hypothesis is tested that the mean number of suicides among males is less than or equal to the mean number of suicides among females. The results of the t-test to test the hypotheses about the mean number of suicides between men and women are presented.

hyp_suic <- as.data.frame(rbind(t.test(clustt1_var$SUIC_M, clustt1_var$SUIC_F, 
                                       alternative="greater")$p.val,
                                t.test(clustt2_var$SUIC_M, clustt2_var$SUIC_F, 
                                       alternative="greater")$p.val,
                                t.test(clustt3_var$SUIC_M, clustt3_var$SUIC_F, 
                                       alternative="greater")$p.val,
                                t.test(clustt4_var$SUIC_M, clustt4_var$SUIC_F, 
                                       alternative="greater")$p.val,
                                t.test(clustt5_var$SUIC_M, clustt5_var$SUIC_F, 
                                       alternative="greater")$p.val))
rownames(hyp_suic) <- clusterNames[1:5]
colnames(hyp_suic) <- "p-value"
hyp_suic

##                p-value
## Cluster 1 1.013598e-08
## Cluster 2 9.300975e-07
## Cluster 3 1.204952e-08
## Cluster 4 1.492020e-06
## Cluster 5 1.267298e-05

It can be concluded that the null hypothesis is rejected in all presented clusters because the p-values of the test performed are less than the significance level α = 0.05. Therefore, the average number of male suicides is not lower than that of females in each of the presented clusters. In the sixth cluster, there is only one country - India. In this country, the number of male suicides (14.1 suicides per 100,000 male population) is also higher than that of females (11.1 suicides per 100,000 female population).

	p-value	Null hypothesis	Result interpretation
Cluster 1	< α	Rejected	The average number of male suicides is not lower than that of females
Cluster 2	< α	Rejected	The average number of male suicides is not lower than that of females
Cluster 3	< α	Rejected	The average number of male suicides is not lower than that of females
Cluster 4	< α	Rejected	The average number of male suicides is not lower than that of females
Cluster 5	< α	Rejected	The average number of male suicides is not lower than that of females
Cluster 6	-	-	The number of male suicides is higher than that of females

Gender Wage Distribution Comparison

Within each cluster, a two-sided hypothesis is tested that the average proportion of male employees who receive wages is equal to the average proportion of female employees who receive wages. The results of the t-test for testing hypotheses about the average proportion of male and female employees who receive wages are presented.

hyp_wage <- as.data.frame(rbind(t.test(clustt1_var$WAGE_M, clustt1_var$WAGE_F, 
                                       alternative="two.sided")$p.val,
                                t.test(clustt2_var$WAGE_M, clustt2_var$WAGE_F, 
                                       alternative="two.sided")$p.val,
                                t.test(clustt3_var$WAGE_M, clustt3_var$WAGE_F, 
                                       alternative="two.sided")$p.val,
                                t.test(clustt4_var$WAGE_M, clustt4_var$WAGE_F, 
                                       alternative="two.sided")$p.val,
                                t.test(clustt5_var$WAGE_M, clustt5_var$WAGE_F, 
                                       alternative="two.sided")$p.val))
rownames(hyp_wage) <- clusterNames[1:5]
colnames(hyp_wage) <- "p-value"
hyp_wage

##                p-value
## Cluster 1 0.0001058282
## Cluster 2 0.4836540070
## Cluster 3 0.0122322417
## Cluster 4 0.0212397381
## Cluster 5 0.8945942619

It can be concluded that the null hypothesis is rejected in the first, third, and fourth clusters because the p-values of the test performed are less than the significance level α = 0.05. Therefore, in these clusters, the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically significant. The first and third cluster countries are mostly located in Europe, and these clusters also include Russia, the United States, and Canada. These countries differ significantly from the countries in the fourth cluster, which are located throughout the African continent. In the second and fifth clusters, the p-values are greater than the significance level, so the null hypothesis is not rejected for these clusters. Therefore, in the second and fifth clusters, the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically insignificant. In India, the proportion of male employees differs from that of females by only one hundredth (male - 24.17% of total employment, female - 24.16% of total employment).

	p-value	Null hypothesis	Result interpretation
Cluster 1	< α	Rejected	the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically significant
Cluster 2	> α	Not rejected	the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically insignificant
Cluster 3	< α	Rejected	the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically significant
Cluster 4	< α	Rejected	the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically significant
Cluster 5	> α	Not rejected	the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically insignificant
Cluster 6	-	-	The difference in this indicator is not significant

Gender Lifespan Comparison

Within each cluster, a two-sided hypothesis is tested that the average expected lifespan of men is equal to the average expected lifespan of women. The results of the t-test for testing hypotheses about the average expected lifespan of men and women are presented.

hyp_lexp <- as.data.frame(rbind(t.test(clustt1_var$LEXP_M, clustt1_var$LEXP_F, 
                                       alternative="two.sided")$p.val,
                                t.test(clustt2_var$LEXP_M, clustt2_var$LEXP_F, 
                                       alternative="two.sided")$p.val,
                                t.test(clustt3_var$LEXP_M, clustt3_var$LEXP_F, 
                                       alternative="two.sided")$p.val,
                                t.test(clustt4_var$LEXP_M, clustt4_var$LEXP_F, 
                                       alternative="two.sided")$p.val,
                                t.test(clustt5_var$LEXP_M, clustt5_var$LEXP_F, 
                                       alternative="two.sided")$p.val))
rownames(hyp_lexp) <- clusterNames[1:5]
colnames(hyp_lexp) <- "p-value"
hyp_lexp

##                p-value
## Cluster 1 1.547953e-13
## Cluster 2 1.078688e-09
## Cluster 3 1.663873e-06
## Cluster 4 6.406870e-02
## Cluster 5 1.144371e-03

It can be concluded that the null hypothesis is rejected in the first, second, third, and fifth clusters since the test p-values are smaller than the significance level of α = 0.05. Therefore, in these clusters, the difference between the expected average lifespan of men and women is statistically significant. These clusters are located throughout the world, so gender inequality in terms of the average expected lifespan prevails in both men and women on every continent. In the fourth cluster, the p-value is greater than the significance level, so the null hypothesis cannot be rejected. Therefore, in the fourth cluster, the difference in the expected average lifespan between men and women is statistically insignificant. Although there are some African countries where these indicators are statistically significant, this inequality does not apply to most African countries. In India, the average life expectancy for men is 70.95 and for women, it is 68.46, so the difference in this indicator is not significant.

	p-value	Null hypothesis	Result interpretation
Cluster 1	< α	Rejected	The difference between the expected average lifespan of men and women is statistically significant
Cluster 2	< α	Rejected	The difference between the expected average lifespan of men and women is statistically significant
Cluster 3	< α	Rejected	The difference between the expected average lifespan of men and women is statistically significant
Cluster 4	> α	Not rejected	The difference in the expected average lifespan between men and women is statistically insignificant
Cluster 5	< α	Rejected	The difference between the expected average lifespan of men and women is statistically significant
Cluster 6	-	-	The difference in this indicator is not significant

Conclusion

World countries can be classified into distinct clusters or groups using a range of analytical methods, including factor analysis, hierarchical and non-hierarchical clustering analysis, and principal component analysis. Among these methods, the k-means approach is considered the most precise as it enables the geographical grouping of countries in each cluster to be visualized effectively through maps.

The geographical position of countries within each cluster appears to contribute to similarities and differences in economic and social factors. A factor analysis was conducted to identify four key factors that explain the variability in the data. The resulting factor interpretations were used to compare the clusters and their respective economic and social characteristics.

Through the application of statistical methods, certain countries were identified as distinctive when compared to others. The initial data was plotted on the coordinate plane of the first two factors, revealing Chad as an outlier. Chad is a Central African country characterized by underdeveloped economy, and high rates of mortality and suicide. Further analysis was carried out using principal component analysis, which generated a graph of the first two principal components that accounted for the majority of the variance in the data. The resulting graph not only highlighted Chad, but also India, which is situated in South Asia and is the second most populous country in the world, with a sizeable population.

Through the formulation and testing of hypotheses, it was observed that there exists a significant difference between men and women in terms of the average number of suicides across all clusters. Furthermore, there are discernible disparities between men and women when investigating the hypothesis that posits the equivalence of expected life expectancies between the two genders.

Full R code

To view the full R code for this project, click here.

Session Info:

sessionInfo()

## R version 4.3.0 (2023-04-21 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: Europe/Vilnius
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.42       DT_0.27          lubridate_1.9.2  forcats_1.0.0   
##  [5] stringr_1.5.0    dplyr_1.1.2      readr_2.1.4      tidyr_1.3.0     
##  [9] tibble_3.2.1     tidyverse_2.0.0  rworldmap_1.3-6  sp_1.6-0        
## [13] psych_2.3.3      factoextra_1.0.7 ggplot2_3.4.2    corrplot_0.92   
## [17] pheatmap_1.0.12  cluster_2.1.4    purrr_1.0.1      xlsx_0.6.5      
## 
## loaded via a namespace (and not attached):
##  [1] dotCall64_1.0-2    gtable_0.3.3       spam_2.9-1         xfun_0.39         
##  [5] bslib_0.4.2        htmlwidgets_1.6.2  rstatix_0.7.2      ggrepel_0.9.3     
##  [9] rJava_1.0-6        lattice_0.21-8     tzdb_0.3.0         crosstalk_1.2.0   
## [13] maptools_1.1-6     vctrs_0.6.2        tools_4.3.0        generics_0.1.3    
## [17] parallel_4.3.0     fansi_1.0.4        highr_0.10         pkgconfig_2.0.3   
## [21] RColorBrewer_1.1-3 lifecycle_1.0.3    farver_2.1.1       compiler_4.3.0    
## [25] fields_14.1        munsell_0.5.0      mnormt_2.1.1       carData_3.0-5     
## [29] htmltools_0.5.5    maps_3.4.1         sass_0.4.5         yaml_2.3.7        
## [33] car_3.1-2          ggpubr_0.6.0       pillar_1.9.0       jquerylib_0.1.4   
## [37] ellipsis_0.3.2     cachem_1.0.7       viridis_0.6.2      abind_1.4-5       
## [41] nlme_3.1-162       tidyselect_1.2.0   digest_0.6.31      stringi_1.7.12    
## [45] labeling_0.4.2     fastmap_1.1.1      grid_4.3.0         colorspace_2.1-0  
## [49] cli_3.6.1          magrittr_2.0.3     xlsxjars_0.6.1     utf8_1.2.3        
## [53] broom_1.0.4        foreign_0.8-84     withr_2.5.0        backports_1.4.1   
## [57] scales_1.2.1       timechange_0.2.0   rmarkdown_2.21     gridExtra_2.3     
## [61] ggsignif_0.6.4     hms_1.1.3          evaluate_0.20      viridisLite_0.4.1 
## [65] rlang_1.1.0        Rcpp_1.0.10        glue_1.6.2         rstudioapi_0.14   
## [69] jsonlite_1.8.4     R6_2.5.1

An Examination of Social and Economic Data Across Nations Worldwide Using Statistical Analysis

Greta Pocevičiūtė

Introduction

Socio-Economic Analysis of Countries Worldwide

Data Review

Structure and Summary of the Data

Structure

Summary

Correlation Coefficients Between Variables

Factor Extraction and Interpretation

Checking if Data Is Suitable for Factor Analysis

Bartlett’s Test of Sphericity

The Kaiser-Meyer-Olkin Measure

Loadings

Heatmap

Table

Country Clustering

Hierarchical Clustering

Lists of Countries in Each Cluster

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

K-means Clustering

Lists of Countries in Each Cluster

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster Interpretation of K-means Results

Principal Component Identification and Interpretation

Contribution of Variables to Dim-1

Contribution of Variables to Dim-2

Contribution of Variables to Dim-1-2

Hypothesis Testing for Clusters

Gender Suicide Rates Comparison

Gender Wage Distribution Comparison

Gender Lifespan Comparison

Conclusion

Full R code

Session Info: