library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(stringr)
library(tibble)
library(ggplot2)
library(reshape2)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(pdp)
library(smacof)
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.5.2
## 
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
## 
##     rescale
## Loading required package: colorspace
## Loading required package: e1071
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:ggplot2':
## 
##     element
## 
## Attaching package: 'smacof'
## The following object is masked from 'package:psych':
## 
##     Procrustes
## The following object is masked from 'package:base':
## 
##     transform
library(cluster)

Introduction

This project applies dimension reduction techniques to a global socio‑economic dataset. The main goal is to explore similarity patterns between countries and to assess whether a low‑dimensional representation can meaningfully summarize complex, high‑dimensional information. Principal Component Analysis (PCA) is used as the primary method, with Multidimensional Scaling (MDS) serving as a complementary, distance‑based approach. The questions that I would like to answer with this project are:

Dataset description

The “Global Country Information 2023” dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons. This dataset is provided by EU funded platform Zenodo. Link: https://zenodo.org/records/8165229

world_data <- read.csv("world-data-2023.csv")

str(world_data)
## 'data.frame':    195 obs. of  35 variables:
##  $ Country                                  : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ Density..P.Km2.                          : chr  "60" "105" "18" "164" ...
##  $ Abbreviation                             : chr  "AF" "AL" "DZ" "AD" ...
##  $ Agricultural.Land....                    : chr  "58.10%" "43.10%" "17.40%" "40.00%" ...
##  $ Land.Area.Km2.                           : chr  "652,230" "28,748" "2,381,741" "468" ...
##  $ Armed.Forces.size                        : chr  "323,000" "9,000" "317,000" "" ...
##  $ Birth.Rate                               : num  32.5 11.8 24.3 7.2 40.7 ...
##  $ Calling.Code                             : int  93 355 213 376 244 1 54 374 61 43 ...
##  $ Capital.Major.City                       : chr  "Kabul" "Tirana" "Algiers" "Andorra la Vella" ...
##  $ Co2.Emissions                            : chr  "8,672" "4,536" "150,006" "469" ...
##  $ CPI                                      : chr  "149.9" "119.05" "151.36" "" ...
##  $ CPI.Change....                           : chr  "2.30%" "1.40%" "2.00%" "" ...
##  $ Currency.Code                            : chr  "AFN" "ALL" "DZD" "EUR" ...
##  $ Fertility.Rate                           : num  4.47 1.62 3.02 1.27 5.52 1.99 2.26 1.76 1.74 1.47 ...
##  $ Forested.Area....                        : chr  "2.10%" "28.10%" "0.80%" "34.00%" ...
##  $ Gasoline.Price                           : chr  "$0.70 " "$1.36 " "$0.28 " "$1.51 " ...
##  $ GDP                                      : chr  "$19,101,353,833 " "$15,278,077,447 " "$169,988,236,398 " "$3,154,057,987 " ...
##  $ Gross.primary.education.enrollment....   : chr  "104.00%" "107.00%" "109.90%" "106.40%" ...
##  $ Gross.tertiary.education.enrollment....  : chr  "9.70%" "55.00%" "51.40%" "" ...
##  $ Infant.mortality                         : num  47.9 7.8 20.1 2.7 51.6 5 8.8 11 3.1 2.9 ...
##  $ Largest.city                             : chr  "Kabul" "Tirana" "Algiers" "Andorra la Vella" ...
##  $ Life.expectancy                          : num  64.5 78.5 76.7 NA 60.8 76.9 76.5 74.9 82.7 81.6 ...
##  $ Maternal.mortality.ratio                 : int  638 15 112 NA 241 42 39 26 6 5 ...
##  $ Minimum.wage                             : chr  "$0.43 " "$1.12 " "$0.95 " "$6.63 " ...
##  $ Official.language                        : chr  "Pashto" "Albanian" "Arabic" "Catalan" ...
##  $ Out.of.pocket.health.expenditure         : chr  "78.40%" "56.90%" "28.10%" "36.40%" ...
##  $ Physicians.per.thousand                  : num  0.28 1.2 1.72 3.33 0.21 2.76 3.96 4.4 3.68 5.17 ...
##  $ Population                               : chr  "38,041,754" "2,854,191" "43,053,054" "77,142" ...
##  $ Population..Labor.force.participation....: chr  "48.90%" "55.70%" "41.20%" "" ...
##  $ Tax.revenue....                          : chr  "9.30%" "18.60%" "37.20%" "" ...
##  $ Total.tax.rate                           : chr  "71.40%" "36.60%" "66.10%" "" ...
##  $ Unemployment.rate                        : chr  "11.12%" "12.33%" "11.70%" "" ...
##  $ Urban_population                         : chr  "9,797,273" "1,747,593" "31,510,100" "67,873" ...
##  $ Latitude                                 : num  33.9 41.2 28 42.5 -11.2 ...
##  $ Longitude                                : num  67.71 20.17 1.66 1.52 17.87 ...

Data Preparation

The dataset contains a mixture of numeric variables and numeric values stored as character strings (e.g. currency symbols, percentages). These need to be cleaned and converted before any multivariate analysis.

clean_numeric <- function(x) {
    x <- str_replace_all(x, ",", "")
    x <- str_replace_all(x, "%", "")
    x <- str_replace_all(x, "\\$", "")
    x <- str_trim(x)
    x[x == ""] <- NA
    as.numeric(x)
}

world_data_clean <- world_data %>%
  mutate(across(where(is.character), clean_numeric))
## Warning: There were 6 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(where(is.character), clean_numeric)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
world_data_clean <- world_data_clean %>%
  select(where(is.numeric)) %>%
  select(where(~ mean(is.na(.)) < 0.5)) %>%
  select(-Latitude, -Longitude, -Calling.Code)

The cleaning function standardizes all numeric information by removing non‑numeric characters and converting values to numeric type. NA introduced warning is due to the fact that some of the values were empty strings, while others were “numeric” strings such as “$9.8”, which had to be turned into numeric. Therefore, the warning does not introduce problems.

Variables with excessive missingness are excluded to avoid unreliable imputations. Geographic coordinates and calling codes are removed because they do not represent socio‑economic characteristics relevant to the analysis.

#Adding for analysis
world_data_clean$Country <- world_data$Country

colSums(is.na(world_data_clean))
##                           Density..P.Km2. 
##                                         0 
##                     Agricultural.Land.... 
##                                         7 
##                            Land.Area.Km2. 
##                                         1 
##                         Armed.Forces.size 
##                                        24 
##                                Birth.Rate 
##                                         6 
##                             Co2.Emissions 
##                                         7 
##                                       CPI 
##                                        17 
##                            CPI.Change.... 
##                                        16 
##                            Fertility.Rate 
##                                         7 
##                         Forested.Area.... 
##                                         7 
##                            Gasoline.Price 
##                                        20 
##                                       GDP 
##                                         2 
##    Gross.primary.education.enrollment.... 
##                                         7 
##   Gross.tertiary.education.enrollment.... 
##                                        12 
##                          Infant.mortality 
##                                         6 
##                           Life.expectancy 
##                                         8 
##                  Maternal.mortality.ratio 
##                                        14 
##                              Minimum.wage 
##                                        45 
##          Out.of.pocket.health.expenditure 
##                                         7 
##                   Physicians.per.thousand 
##                                         7 
##                                Population 
##                                         1 
## Population..Labor.force.participation.... 
##                                        19 
##                           Tax.revenue.... 
##                                        26 
##                            Total.tax.rate 
##                                        12 
##                         Unemployment.rate 
##                                        19 
##                          Urban_population 
##                                         5 
##                                   Country 
##                                         0
rowMeans(is.na(world_data_clean)) * 100
##   [1]  0.000000  0.000000  0.000000 37.037037  0.000000  7.407407  0.000000
##   [8]  0.000000  0.000000  3.703704  0.000000  0.000000  3.703704  0.000000
##  [15]  0.000000  3.703704  0.000000  0.000000  0.000000  0.000000  3.703704
##  [22]  3.703704  0.000000  0.000000  7.407407  0.000000  0.000000  3.703704
##  [29]  0.000000  0.000000  3.703704  0.000000  0.000000  0.000000  3.703704
##  [36]  0.000000  0.000000  0.000000 11.111111  0.000000  0.000000  0.000000
##  [43] 18.518519  3.703704  0.000000  0.000000  3.703704  7.407407 18.518519
##  [50]  0.000000  3.703704  3.703704  0.000000  3.703704 14.814815  0.000000
##  [57] 74.074074  3.703704  0.000000  3.703704  0.000000  0.000000  0.000000
##  [64]  0.000000  0.000000  0.000000  0.000000 14.814815  0.000000  3.703704
##  [71]  3.703704  3.703704  3.703704 85.185185  0.000000  0.000000  3.703704
##  [78]  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  3.703704
##  [85]  0.000000  0.000000  0.000000  0.000000  0.000000 22.222222  0.000000
##  [92]  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  3.703704
##  [99] 40.740741  0.000000  0.000000  0.000000  0.000000  0.000000  3.703704
## [106]  0.000000  0.000000 22.222222  3.703704  0.000000  0.000000 18.518519
## [113]  0.000000 55.555556  0.000000  3.703704  0.000000  0.000000  0.000000
## [120]  3.703704 81.481481  0.000000  0.000000  0.000000  0.000000  0.000000
## [127]  0.000000 22.222222 77.777778  3.703704  0.000000  0.000000 18.518519
## [134] 92.592593  3.703704  0.000000  0.000000  0.000000  0.000000  0.000000
## [141]  0.000000  3.703704  0.000000  0.000000  3.703704 18.518519  7.407407
## [148]  7.407407  3.703704 25.925926  7.407407  0.000000  0.000000  0.000000
## [155] 11.111111  0.000000  3.703704  0.000000  0.000000 11.111111 18.518519
## [162]  3.703704  0.000000 22.222222  0.000000  0.000000  0.000000  3.703704
## [169]  3.703704  3.703704  0.000000  0.000000  0.000000  0.000000  0.000000
## [176]  0.000000 11.111111  0.000000  0.000000  0.000000 14.814815 51.851852
## [183]  0.000000  0.000000  3.703704  0.000000  0.000000  0.000000  7.407407
## [190]  3.703704  3.703704  0.000000  7.407407  0.000000  3.703704
world_data_clean[rowMeans(is.na(world_data_clean)) > 0.15, 27]
##  [1] "Andorra"                        "Cuba"                          
##  [3] "Dominica"                       "Eswatini"                      
##  [5] "Vatican City"                   "Kiribati"                      
##  [7] "Liechtenstein"                  "Marshall Islands"              
##  [9] "Federated States of Micronesia" "Monaco"                        
## [11] "Nauru"                          "North Korea"                   
## [13] "North Macedonia"                "Palau"                         
## [15] "Palestinian National Authority" "Saint Kitts and Nevis"         
## [17] "San Marino"                     "Somalia"                       
## [19] "South Sudan"                    "Tuvalu"
world_data_clean <- world_data_clean[rowMeans(is.na(world_data_clean)) < 0.15, ]
colSums(is.na(world_data_clean))
##                           Density..P.Km2. 
##                                         0 
##                     Agricultural.Land.... 
##                                         0 
##                            Land.Area.Km2. 
##                                         0 
##                         Armed.Forces.size 
##                                         8 
##                                Birth.Rate 
##                                         0 
##                             Co2.Emissions 
##                                         0 
##                                       CPI 
##                                         4 
##                            CPI.Change.... 
##                                         3 
##                            Fertility.Rate 
##                                         0 
##                         Forested.Area.... 
##                                         0 
##                            Gasoline.Price 
##                                         8 
##                                       GDP 
##                                         0 
##    Gross.primary.education.enrollment.... 
##                                         1 
##   Gross.tertiary.education.enrollment.... 
##                                         2 
##                          Infant.mortality 
##                                         0 
##                           Life.expectancy 
##                                         0 
##                  Maternal.mortality.ratio 
##                                         0 
##                              Minimum.wage 
##                                        32 
##          Out.of.pocket.health.expenditure 
##                                         0 
##                   Physicians.per.thousand 
##                                         0 
##                                Population 
##                                         0 
## Population..Labor.force.participation.... 
##                                         3 
##                           Tax.revenue.... 
##                                        15 
##                            Total.tax.rate 
##                                         1 
##                         Unemployment.rate 
##                                         3 
##                          Urban_population 
##                                         0 
##                                   Country 
##                                         0

After seeing that most of the countries that have missing data are with NAs in more than 15% of the variables, I decided to remove them, since >4,5 variables missing is a problem. This way, first stage of cleaning is finished and we got rid of 20 countries that had not enough data, which is not a big loss. Moreover, analysis showed those countries are countries with no real data at all/microcountries/islands.

world_data_clean[is.na(world_data_clean$Minimum.wage),27]
##  [1] "Austria"              "Bahrain"              "Brunei"              
##  [4] "Burundi"              "Cambodia"             "Cyprus"              
##  [7] "Denmark"              "Djibouti"             "Egypt"               
## [10] "Eritrea"              "Ethiopia"             "Finland"             
## [13] "Grenada"              "Guinea"               "Iceland"             
## [16] "Italy"                "Maldives"             "Namibia"             
## [19] "Norway"               "Qatar"                "Rwanda"              
## [22] "Saint Lucia"          "S�����������"         "Singapore"           
## [25] "South Africa"         "Suriname"             "Sweden"              
## [28] "Switzerland"          "Tonga"                "United Arab Emirates"
## [31] "Yemen"                "Zimbabwe"
world_data_clean <- world_data_clean[!is.na(world_data_clean$Minimum.wage), ]

world_data_clean[is.na(world_data_clean$Tax.revenue....),27]
##  [1] "Chad"         "Comoros"      "Ecuador"      "Guyana"       "Haiti"       
##  [6] "Libya"        "Mauritania"   "Montenegro"   "Panama"       "Turkmenistan"
## [11] "Venezuela"
world_data_clean <- world_data_clean[!is.na(world_data_clean$Tax.revenue....), ]

Above, we had to pay attention to NAs in Armed forces, Minimum wage and Tax revenue specifically, because these variables had many NAs and can not be properly replaced. For Min.wage, unfortunately, the only way was to remove the countries with NAs, but luckily most of those countries were micronations. Countries with missing Tax.revenue were all micronations or closed countries, which are hard to get information about, therefore they were also removed.

world_data_clean[is.na(world_data_clean$Armed.Forces.size),27]
## [1] "Saint Vincent and the Grenadines" "Samoa"                           
## [3] "Solomon Islands"                  "Vanuatu"
world_data_clean$Armed.Forces.size[is.na(world_data_clean$Armed.Forces.size)] <- 0

colSums(is.na(world_data_clean))
##                           Density..P.Km2. 
##                                         0 
##                     Agricultural.Land.... 
##                                         0 
##                            Land.Area.Km2. 
##                                         0 
##                         Armed.Forces.size 
##                                         0 
##                                Birth.Rate 
##                                         0 
##                             Co2.Emissions 
##                                         0 
##                                       CPI 
##                                         2 
##                            CPI.Change.... 
##                                         1 
##                            Fertility.Rate 
##                                         0 
##                         Forested.Area.... 
##                                         0 
##                            Gasoline.Price 
##                                         5 
##                                       GDP 
##                                         0 
##    Gross.primary.education.enrollment.... 
##                                         1 
##   Gross.tertiary.education.enrollment.... 
##                                         2 
##                          Infant.mortality 
##                                         0 
##                           Life.expectancy 
##                                         0 
##                  Maternal.mortality.ratio 
##                                         0 
##                              Minimum.wage 
##                                         0 
##          Out.of.pocket.health.expenditure 
##                                         0 
##                   Physicians.per.thousand 
##                                         0 
##                                Population 
##                                         0 
## Population..Labor.force.participation.... 
##                                         2 
##                           Tax.revenue.... 
##                                         0 
##                            Total.tax.rate 
##                                         0 
##                         Unemployment.rate 
##                                         2 
##                          Urban_population 
##                                         0 
##                                   Country 
##                                         0

For Armed.Forces.size, it gets more interesting, because looking at the list of countries with missing Armed.Forces, we can conclude that there is no army in those small island-countries, and set those values to 0.

After seeing that we managed to reduce the number of NAs significantly, the only variables that are left with some NAs are those that we can simply replace with means, without removing important information or deleting the whole observations. Those variables are: Gasoline.price, Cpi.Change, Unemployment.rate, Gross.tertiary.education, and Gross.primary.education. We replace all these variables with means and have no missing values after that.

world_data_clean <- as.data.frame(
  lapply(world_data_clean, function(x) {
    replace(x, is.na(x), mean(x, na.rm = TRUE))
  })
)
## Warning in mean.default(x, na.rm = TRUE): argument is not numeric or logical:
## returning NA
colSums(is.na(world_data_clean))
##                           Density..P.Km2. 
##                                         0 
##                     Agricultural.Land.... 
##                                         0 
##                            Land.Area.Km2. 
##                                         0 
##                         Armed.Forces.size 
##                                         0 
##                                Birth.Rate 
##                                         0 
##                             Co2.Emissions 
##                                         0 
##                                       CPI 
##                                         0 
##                            CPI.Change.... 
##                                         0 
##                            Fertility.Rate 
##                                         0 
##                         Forested.Area.... 
##                                         0 
##                            Gasoline.Price 
##                                         0 
##                                       GDP 
##                                         0 
##    Gross.primary.education.enrollment.... 
##                                         0 
##   Gross.tertiary.education.enrollment.... 
##                                         0 
##                          Infant.mortality 
##                                         0 
##                           Life.expectancy 
##                                         0 
##                  Maternal.mortality.ratio 
##                                         0 
##                              Minimum.wage 
##                                         0 
##          Out.of.pocket.health.expenditure 
##                                         0 
##                   Physicians.per.thousand 
##                                         0 
##                                Population 
##                                         0 
## Population..Labor.force.participation.... 
##                                         0 
##                           Tax.revenue.... 
##                                         0 
##                            Total.tax.rate 
##                                         0 
##                         Unemployment.rate 
##                                         0 
##                          Urban_population 
##                                         0 
##                                   Country 
##                                         0

Variables Distribution

worldPlot1 <- melt(world_data_clean[, 1:12])
## No id variables; using all as measure variables
ggplot(worldPlot1) +
  geom_histogram(aes(x = value), bins = 30, fill = "grey70", color = "black") +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  theme_minimal() +
  theme(
    strip.text = element_text(size = 9),
    plot.title = element_text(hjust = 0.5, size = 15)
  ) +
  labs(title = "Socio-Economic Indicators Distribution (1–12)", x = NULL, y = "Count") +
  theme_minimal()

Interesting facts to note: Population Density has couple of countries which are extreme outliers, probably like Bangladesh. Agricultural Land Proportion has almost normal distribution, since this is obviously a natural random phenomenon. Area by Km^2 also has couple of extreme outliers like Russia, USA, China, and Canada. Birth and Fertility Rate seem to be positively correlated. Gasoline.Price is also almost normally distributed, which is interesting.

worldPlot2 <- melt(world_data_clean[, 13:24])
## No id variables; using all as measure variables
ggplot(worldPlot2) +
  geom_histogram(aes(x = value), bins = 30, fill = "grey70", color = "black") +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  theme_minimal() +
  theme(
    strip.text = element_text(size = 9),
    plot.title = element_text(hjust = 0.5, size = 15)
  ) +
  labs(title = "Socio-Economic Indicators Distribution (13–24)", x = NULL, y = "Count") +
  theme_minimal()

In this batch, variables seem more related to one another, because there are many of them that are close to normal like Population Labor force participation, Total tax rate, Primary Education Enrollment. Others look like they have extreme outliers again, Ex: Minimum.wage is extremely high in Scandinavian countries, Switzerland, Luxembourg, and etc.

worldPlot3 <- melt(world_data_clean[, 24:26])
## No id variables; using all as measure variables
ggplot(worldPlot3) +
  geom_histogram(aes(x = value), bins = 30, fill = "grey70", color = "black") +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  theme_minimal() +
  theme(
    strip.text = element_text(size = 9),
    plot.title = element_text(hjust = 0.5, size = 15)
  ) +
  labs(title = "Socio-Economic Indicators Distribution (25–26)", x = NULL, y = "Count") +
  theme_minimal()

Here we can see that total tax rate and unemployment rate are close in distribution, and urban population has severe outliers (India and China).

Exploratory Structure

country_names <- world_data_clean$Country
X <- world_data_clean %>% select(-Country)

world_data_clean_scaled <- scale(X)

cor_mat <- cor(world_data_clean_scaled)
corrplot::corrplot(cor_mat, tl.cex = 0.5)

From the correlation matrix we can see that Population is highly and positively correlated to Armed Forces, and Urban population which completely makes sense from real life. Maternal mortality ratio has high positive correlation to infant mortality ratio. Urban population to Co2 emission and armed forces. Population related statistics could be a good are to reduce the number of dimensions. Another good area would be related to education. Gross tertiary education enrollment has high negative correlation with Birth and Fertility rate, and therefore to infant and maternal mortality. On the opposite it has high positive correlation with physicians per thousand, indicating the development of the country.

KMO(cor_mat)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_mat)
## Overall MSA =  0.76
## MSA for each item = 
##                           Density..P.Km2. 
##                                      0.62 
##                     Agricultural.Land.... 
##                                      0.34 
##                            Land.Area.Km2. 
##                                      0.70 
##                         Armed.Forces.size 
##                                      0.79 
##                                Birth.Rate 
##                                      0.77 
##                             Co2.Emissions 
##                                      0.75 
##                                       CPI 
##                                      0.61 
##                            CPI.Change.... 
##                                      0.53 
##                            Fertility.Rate 
##                                      0.76 
##                         Forested.Area.... 
##                                      0.47 
##                            Gasoline.Price 
##                                      0.64 
##                                       GDP 
##                                      0.77 
##    Gross.primary.education.enrollment.... 
##                                      0.32 
##   Gross.tertiary.education.enrollment.... 
##                                      0.91 
##                          Infant.mortality 
##                                      0.83 
##                           Life.expectancy 
##                                      0.81 
##                  Maternal.mortality.ratio 
##                                      0.91 
##                              Minimum.wage 
##                                      0.77 
##          Out.of.pocket.health.expenditure 
##                                      0.70 
##                   Physicians.per.thousand 
##                                      0.88 
##                                Population 
##                                      0.69 
## Population..Labor.force.participation.... 
##                                      0.65 
##                           Tax.revenue.... 
##                                      0.87 
##                            Total.tax.rate 
##                                      0.57 
##                         Unemployment.rate 
##                                      0.46 
##                          Urban_population 
##                                      0.74
cortest.bartlett(cor_mat, n = 132)
## $chisq
## [1] 3247.735
## 
## $p.value
## [1] 0
## 
## $df
## [1] 325

In KMO test, our Overall MSA = 0.76, which is good and higher than 0.6 threshold to continue the analysis. In the bartlett’s test, the p-value = 0, which means that the correlation matrix is completely different from the identity matrix. According to these tests, the data is suitable for the PCA.

PCA

pca <- prcomp(world_data_clean_scaled, center = TRUE, scale. = TRUE)

fviz_eig(pca, choice = "eigenvalue", ncp = 22, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4",  addlabels = TRUE,   main = "Eigenvalues")
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5    PC6     PC7
## Standard deviation     2.6096 2.1824 1.52706 1.28640 1.1739 1.1054 1.07849
## Proportion of Variance 0.2619 0.1832 0.08969 0.06365 0.0530 0.0470 0.04474
## Cumulative Proportion  0.2619 0.4451 0.53479 0.59844 0.6514 0.6984 0.74317
##                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.0500 0.99860 0.80669 0.77316 0.73523 0.69377 0.63934
## Proportion of Variance 0.0424 0.03835 0.02503 0.02299 0.02079 0.01851 0.01572
## Cumulative Proportion  0.7856 0.82393 0.84896 0.87195 0.89274 0.91125 0.92698
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.58768 0.57946 0.55562 0.51418 0.45107 0.43586 0.30805
## Proportion of Variance 0.01328 0.01291 0.01187 0.01017 0.00783 0.00731 0.00365
## Cumulative Proportion  0.94026 0.95317 0.96505 0.97522 0.98304 0.99035 0.99400
##                          PC22    PC23    PC24   PC25    PC26
## Standard deviation     0.2746 0.18681 0.16312 0.1018 0.09358
## Proportion of Variance 0.0029 0.00134 0.00102 0.0004 0.00034
## Cumulative Proportion  0.9969 0.99824 0.99926 0.9997 1.00000

Based on Kaiser’s rule we should choose 8 dimensions for this dataset as the most optimal number, because eigenvalue is >1 starting from 8 components.

fviz_pca_var(pca, col.var = "brown4")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This plot better visualizes the relation of variables and in which way they can be grouped into dimensions.

Land Area, GDP, Urban Population, Army size contribute a huge amount to the second dimension, and are all positively correlated. The first dimension has way more variables and therefore explains more variance, this dimension includes Fertility Rate, Birth Rate, Tertiary Education, and others. However, we can see that variables in the second dimension have more contribution per variable, which leads to the idea that those are the most important ones.

fviz_screeplot(pca, addlabels = TRUE, barfill = "salmon", barcolor = "tomato3", linecolor = "tomato4")
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

From the scree plot, we can see that first two dimensions explain ~45% of the variance, which is not too terrible for our case, but it is not good. Adding a third dimension would make it ~55%, which makes it significantly better, and so on.

pca_var <- get_pca_var(pca)

fviz_contrib(pca, "var", axes = 1:8, fill = "tomato3", color = "tomato4")

As I noted before this plot, Urban Population, Co2 Emissions, Birth Rate, Army size are the variables that can be grouped in certain ways with others to reduce dimensions. We can see they have the highest contribution to the dimensions with Urban Population contributing to almost 5%, which confirms that Urban Population is strongest variable, even though it is in the second dimension which explains less variance than the first.

pca_scores <- pca$x[, 1:2]

plot(pca_scores,
     xlab = "PC1",
     ylab = "PC2",
     main = "PCA of Countries Based on Socio-Economic Indicators",
     type = "n")

text(pca_scores, labels = country_names, cex = 0.6)

MDS with Euclidean distance

dist_world <- dist(world_data_clean_scaled, method = "euclidean")
mds <- cmdscale(dist_world, k = 2)
plot(mds,
     xlab = "Dimension 1",
     ylab = "Dimension 2",
     main = "MDS of Countries Based on Socio-Economic Indicators",
     type = "n")

text(mds, labels = country_names, cex = 0.6)

From above image, we can see that China is the most extreme outlier, while US, India, and Russia are also outliers but not as far as China. China has more differences due to the state’s ideology. Other 3 countries are outliers mostly due to their huge size and population. In India’s case, the specificity is in population related metrics. In Russia’s case it is mostly about the area of the country. In US case, all of the metrics are much higher than the average, but not the highest in the world. This image is in 2D, which means that some of the conclusions about outliers might be untrue, since our 2 dimensions cover only 45% of the variance.

mds_smacof <- mds(dist_world, ndim = 2, type = "ratio")
mds_smacof$stress
## [1] 0.2242191

Due to extreme outliers and countries’ similarities not well-captured in 2D, we get a low stress value. The whole world data can not be smooth and highly related, which leads to impossibility of 2D interpretation. Increasing dimensionality to 3 substantially improves the goodness of fit, confirming that the structure of the data is not adequately represented in 2 dimensions.

mds_3d <- mds(dist_world, ndim = 3)
mds_3d$stress
## [1] 0.1560127
mds_df <- as.data.frame(mds)
mds_df$Country <- country_names

mds_df[order(mds_df$V1), ][1:5, ]
##            V1          V2       Country
## 30  -5.792256 -17.2181765         China
## 127 -5.151687  -8.4683413 United States
## 48  -4.221929   1.6139743        Greece
## 8   -4.144360   0.1366599     Australia
## 86  -3.777922   1.5763425   Netherlands
mds_df[order(-mds_df$V1), ][1:5, ]
##           V1         V2                  Country
## 28  5.936277 -1.0034435 Central African Republic
## 90  5.672294 -2.7147704                  Nigeria
## 107 5.654711 -0.6911262             Sierra Leone
## 89  4.847501 -0.5813831                    Niger
## 76  4.730577 -0.5653628                     Mali

Looking at first dimension specifically, Russia and India are not the top 5 biggest outliers. They are only such in the context of Land Area and Population together. Interesting fact is that our first 2 dimensions cover those 2 metrics by a lot. The first dimension outliers which are shown on the table cover such statistics as Birth rate, Fertility Rate, Education, and more. We can see China and US are still here as extreme outliers, and China is a huge outlier looking at the second dimension(which included GDP, Population, and etc.) which is a good observation. Greece, Australia, and Netherlands are added in the list which tells us much about cultural uniquness of those countries (low population increase, high number of physicians, and probably high focus on tertiary education). On the other hand, the complete opposite of these countries can be seen on the second table, those are 5 African countries with extremely high population increase, humanitarian crisis due to poor medicine, and low focus on education due to the unavailability of it.

MDS with Gower distance

dist_gower_world <- daisy(world_data_clean[, -27]
                          , metric = "gower")
mds_gower <- cmdscale(dist_gower_world, k = 2)

plot(mds_gower,
     xlab = "Dimension 1",
     ylab = "Dimension 2",
     main = "MDS of Countries Based on Socio-Economic Indicators (Gower)",
     type = "n")

text(mds_gower, labels = country_names, cex = 0.6)

We can see in this case, two dimensions are better captured, as more outliers are visible, since China is not so extreme. For example, Afghanistan, Sudan, Iran are far from the bigger groups.

mds_smacof_gower <- mds(dist_gower_world, ndim = 2, type = "ratio")
mds_smacof_gower$stress
## [1] 0.2015165

MDS with Gower distance gives us better stress score, let’s take a look at outliers and see if we can notice differences.

mds_gower_df <- as.data.frame(mds_gower)
mds_gower_df$Country <- country_names

mds_gower_df[order(mds_gower_df$V1), ][1:5, ]
##             V1           V2       Country
## 127 -0.1690145 -0.054828382 United States
## 30  -0.1372541 -0.131520861         China
## 8   -0.1353132  0.010828047     Australia
## 48  -0.1347476  0.007557629        Greece
## 86  -0.1343809  0.024573341   Netherlands
mds_gower_df[order(-mds_gower_df$V1), ][1:5, ]
##            V1           V2                  Country
## 90  0.1997796 -0.124961518                  Nigeria
## 28  0.1848038  0.063357506 Central African Republic
## 89  0.1639130 -0.001678436                    Niger
## 114 0.1583318 -0.079280302                    Sudan
## 76  0.1569685  0.002994005                     Mali

The top outliers from both components’ perspective are exactly the same as with Euclidean distance.

Conclusion

To say couple of final words, the dimension reduction in context of global socio-economic factors per country can give us some valuable insights, but is not very strong. For example: China, USA, India, and Russia are clear global outliers from many perspectives, which we managed to retrieve through 2 dimensions. One dimension highlighted humanitarian crisis and population boom in African countries such as Central African Republic. This means that the answer to the first question is: “Yes, we can note global outliers via dimension reduction, but due to diversity of different factors, not all outliers are visible.” For instance, Arabic Gulf countries are very specific and rich, but first 2 dimensions is not able to capture that, since it covers other factors.

The similarity between PCA and classical MDS results is expected, as classical MDS applied to Euclidean distances is mathematically equivalent to PCA on centered data. MDS applied to Gower distances provides a slightly more clear picture of outliers, indicating Gower as a better metric in this case.

The most contributing factor is Urban Population, since it has other variables like Population and Co2 Emissions which are very closely correlated. Fertility and Birth rate are also the factors that highlight global inequality.

Viewing global data can be reduced in dimensions, but higher than 2 would let us extract much more knowledge from it. This leads to the potential extension of this project with the analysis of 3 and 4 dimensional data with tables and 3D view.