library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(stringr)
library(tibble)
library(ggplot2)
library(reshape2)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(pdp)
library(smacof)
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.5.2
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
##
## rescale
## Loading required package: colorspace
## Loading required package: e1071
##
## Attaching package: 'e1071'
## The following object is masked from 'package:ggplot2':
##
## element
##
## Attaching package: 'smacof'
## The following object is masked from 'package:psych':
##
## Procrustes
## The following object is masked from 'package:base':
##
## transform
library(cluster)
This project applies dimension reduction techniques to a global socio‑economic dataset. The main goal is to explore similarity patterns between countries and to assess whether a low‑dimensional representation can meaningfully summarize complex, high‑dimensional information. Principal Component Analysis (PCA) is used as the primary method, with Multidimensional Scaling (MDS) serving as a complementary, distance‑based approach. The questions that I would like to answer with this project are:
Which indicators contribute most to global inequality?
How robust are results across different distance measures?
The “Global Country Information 2023” dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons. This dataset is provided by EU funded platform Zenodo. Link: https://zenodo.org/records/8165229
world_data <- read.csv("world-data-2023.csv")
str(world_data)
## 'data.frame': 195 obs. of 35 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ Density..P.Km2. : chr "60" "105" "18" "164" ...
## $ Abbreviation : chr "AF" "AL" "DZ" "AD" ...
## $ Agricultural.Land.... : chr "58.10%" "43.10%" "17.40%" "40.00%" ...
## $ Land.Area.Km2. : chr "652,230" "28,748" "2,381,741" "468" ...
## $ Armed.Forces.size : chr "323,000" "9,000" "317,000" "" ...
## $ Birth.Rate : num 32.5 11.8 24.3 7.2 40.7 ...
## $ Calling.Code : int 93 355 213 376 244 1 54 374 61 43 ...
## $ Capital.Major.City : chr "Kabul" "Tirana" "Algiers" "Andorra la Vella" ...
## $ Co2.Emissions : chr "8,672" "4,536" "150,006" "469" ...
## $ CPI : chr "149.9" "119.05" "151.36" "" ...
## $ CPI.Change.... : chr "2.30%" "1.40%" "2.00%" "" ...
## $ Currency.Code : chr "AFN" "ALL" "DZD" "EUR" ...
## $ Fertility.Rate : num 4.47 1.62 3.02 1.27 5.52 1.99 2.26 1.76 1.74 1.47 ...
## $ Forested.Area.... : chr "2.10%" "28.10%" "0.80%" "34.00%" ...
## $ Gasoline.Price : chr "$0.70 " "$1.36 " "$0.28 " "$1.51 " ...
## $ GDP : chr "$19,101,353,833 " "$15,278,077,447 " "$169,988,236,398 " "$3,154,057,987 " ...
## $ Gross.primary.education.enrollment.... : chr "104.00%" "107.00%" "109.90%" "106.40%" ...
## $ Gross.tertiary.education.enrollment.... : chr "9.70%" "55.00%" "51.40%" "" ...
## $ Infant.mortality : num 47.9 7.8 20.1 2.7 51.6 5 8.8 11 3.1 2.9 ...
## $ Largest.city : chr "Kabul" "Tirana" "Algiers" "Andorra la Vella" ...
## $ Life.expectancy : num 64.5 78.5 76.7 NA 60.8 76.9 76.5 74.9 82.7 81.6 ...
## $ Maternal.mortality.ratio : int 638 15 112 NA 241 42 39 26 6 5 ...
## $ Minimum.wage : chr "$0.43 " "$1.12 " "$0.95 " "$6.63 " ...
## $ Official.language : chr "Pashto" "Albanian" "Arabic" "Catalan" ...
## $ Out.of.pocket.health.expenditure : chr "78.40%" "56.90%" "28.10%" "36.40%" ...
## $ Physicians.per.thousand : num 0.28 1.2 1.72 3.33 0.21 2.76 3.96 4.4 3.68 5.17 ...
## $ Population : chr "38,041,754" "2,854,191" "43,053,054" "77,142" ...
## $ Population..Labor.force.participation....: chr "48.90%" "55.70%" "41.20%" "" ...
## $ Tax.revenue.... : chr "9.30%" "18.60%" "37.20%" "" ...
## $ Total.tax.rate : chr "71.40%" "36.60%" "66.10%" "" ...
## $ Unemployment.rate : chr "11.12%" "12.33%" "11.70%" "" ...
## $ Urban_population : chr "9,797,273" "1,747,593" "31,510,100" "67,873" ...
## $ Latitude : num 33.9 41.2 28 42.5 -11.2 ...
## $ Longitude : num 67.71 20.17 1.66 1.52 17.87 ...
The dataset contains a mixture of numeric variables and numeric values stored as character strings (e.g. currency symbols, percentages). These need to be cleaned and converted before any multivariate analysis.
clean_numeric <- function(x) {
x <- str_replace_all(x, ",", "")
x <- str_replace_all(x, "%", "")
x <- str_replace_all(x, "\\$", "")
x <- str_trim(x)
x[x == ""] <- NA
as.numeric(x)
}
world_data_clean <- world_data %>%
mutate(across(where(is.character), clean_numeric))
## Warning: There were 6 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(where(is.character), clean_numeric)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
world_data_clean <- world_data_clean %>%
select(where(is.numeric)) %>%
select(where(~ mean(is.na(.)) < 0.5)) %>%
select(-Latitude, -Longitude, -Calling.Code)
The cleaning function standardizes all numeric information by removing non‑numeric characters and converting values to numeric type. NA introduced warning is due to the fact that some of the values were empty strings, while others were “numeric” strings such as “$9.8”, which had to be turned into numeric. Therefore, the warning does not introduce problems.
Variables with excessive missingness are excluded to avoid unreliable imputations. Geographic coordinates and calling codes are removed because they do not represent socio‑economic characteristics relevant to the analysis.
#Adding for analysis
world_data_clean$Country <- world_data$Country
colSums(is.na(world_data_clean))
## Density..P.Km2.
## 0
## Agricultural.Land....
## 7
## Land.Area.Km2.
## 1
## Armed.Forces.size
## 24
## Birth.Rate
## 6
## Co2.Emissions
## 7
## CPI
## 17
## CPI.Change....
## 16
## Fertility.Rate
## 7
## Forested.Area....
## 7
## Gasoline.Price
## 20
## GDP
## 2
## Gross.primary.education.enrollment....
## 7
## Gross.tertiary.education.enrollment....
## 12
## Infant.mortality
## 6
## Life.expectancy
## 8
## Maternal.mortality.ratio
## 14
## Minimum.wage
## 45
## Out.of.pocket.health.expenditure
## 7
## Physicians.per.thousand
## 7
## Population
## 1
## Population..Labor.force.participation....
## 19
## Tax.revenue....
## 26
## Total.tax.rate
## 12
## Unemployment.rate
## 19
## Urban_population
## 5
## Country
## 0
rowMeans(is.na(world_data_clean)) * 100
## [1] 0.000000 0.000000 0.000000 37.037037 0.000000 7.407407 0.000000
## [8] 0.000000 0.000000 3.703704 0.000000 0.000000 3.703704 0.000000
## [15] 0.000000 3.703704 0.000000 0.000000 0.000000 0.000000 3.703704
## [22] 3.703704 0.000000 0.000000 7.407407 0.000000 0.000000 3.703704
## [29] 0.000000 0.000000 3.703704 0.000000 0.000000 0.000000 3.703704
## [36] 0.000000 0.000000 0.000000 11.111111 0.000000 0.000000 0.000000
## [43] 18.518519 3.703704 0.000000 0.000000 3.703704 7.407407 18.518519
## [50] 0.000000 3.703704 3.703704 0.000000 3.703704 14.814815 0.000000
## [57] 74.074074 3.703704 0.000000 3.703704 0.000000 0.000000 0.000000
## [64] 0.000000 0.000000 0.000000 0.000000 14.814815 0.000000 3.703704
## [71] 3.703704 3.703704 3.703704 85.185185 0.000000 0.000000 3.703704
## [78] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.703704
## [85] 0.000000 0.000000 0.000000 0.000000 0.000000 22.222222 0.000000
## [92] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.703704
## [99] 40.740741 0.000000 0.000000 0.000000 0.000000 0.000000 3.703704
## [106] 0.000000 0.000000 22.222222 3.703704 0.000000 0.000000 18.518519
## [113] 0.000000 55.555556 0.000000 3.703704 0.000000 0.000000 0.000000
## [120] 3.703704 81.481481 0.000000 0.000000 0.000000 0.000000 0.000000
## [127] 0.000000 22.222222 77.777778 3.703704 0.000000 0.000000 18.518519
## [134] 92.592593 3.703704 0.000000 0.000000 0.000000 0.000000 0.000000
## [141] 0.000000 3.703704 0.000000 0.000000 3.703704 18.518519 7.407407
## [148] 7.407407 3.703704 25.925926 7.407407 0.000000 0.000000 0.000000
## [155] 11.111111 0.000000 3.703704 0.000000 0.000000 11.111111 18.518519
## [162] 3.703704 0.000000 22.222222 0.000000 0.000000 0.000000 3.703704
## [169] 3.703704 3.703704 0.000000 0.000000 0.000000 0.000000 0.000000
## [176] 0.000000 11.111111 0.000000 0.000000 0.000000 14.814815 51.851852
## [183] 0.000000 0.000000 3.703704 0.000000 0.000000 0.000000 7.407407
## [190] 3.703704 3.703704 0.000000 7.407407 0.000000 3.703704
world_data_clean[rowMeans(is.na(world_data_clean)) > 0.15, 27]
## [1] "Andorra" "Cuba"
## [3] "Dominica" "Eswatini"
## [5] "Vatican City" "Kiribati"
## [7] "Liechtenstein" "Marshall Islands"
## [9] "Federated States of Micronesia" "Monaco"
## [11] "Nauru" "North Korea"
## [13] "North Macedonia" "Palau"
## [15] "Palestinian National Authority" "Saint Kitts and Nevis"
## [17] "San Marino" "Somalia"
## [19] "South Sudan" "Tuvalu"
world_data_clean <- world_data_clean[rowMeans(is.na(world_data_clean)) < 0.15, ]
colSums(is.na(world_data_clean))
## Density..P.Km2.
## 0
## Agricultural.Land....
## 0
## Land.Area.Km2.
## 0
## Armed.Forces.size
## 8
## Birth.Rate
## 0
## Co2.Emissions
## 0
## CPI
## 4
## CPI.Change....
## 3
## Fertility.Rate
## 0
## Forested.Area....
## 0
## Gasoline.Price
## 8
## GDP
## 0
## Gross.primary.education.enrollment....
## 1
## Gross.tertiary.education.enrollment....
## 2
## Infant.mortality
## 0
## Life.expectancy
## 0
## Maternal.mortality.ratio
## 0
## Minimum.wage
## 32
## Out.of.pocket.health.expenditure
## 0
## Physicians.per.thousand
## 0
## Population
## 0
## Population..Labor.force.participation....
## 3
## Tax.revenue....
## 15
## Total.tax.rate
## 1
## Unemployment.rate
## 3
## Urban_population
## 0
## Country
## 0
After seeing that most of the countries that have missing data are with NAs in more than 15% of the variables, I decided to remove them, since >4,5 variables missing is a problem. This way, first stage of cleaning is finished and we got rid of 20 countries that had not enough data, which is not a big loss. Moreover, analysis showed those countries are countries with no real data at all/microcountries/islands.
world_data_clean[is.na(world_data_clean$Minimum.wage),27]
## [1] "Austria" "Bahrain" "Brunei"
## [4] "Burundi" "Cambodia" "Cyprus"
## [7] "Denmark" "Djibouti" "Egypt"
## [10] "Eritrea" "Ethiopia" "Finland"
## [13] "Grenada" "Guinea" "Iceland"
## [16] "Italy" "Maldives" "Namibia"
## [19] "Norway" "Qatar" "Rwanda"
## [22] "Saint Lucia" "S�����������" "Singapore"
## [25] "South Africa" "Suriname" "Sweden"
## [28] "Switzerland" "Tonga" "United Arab Emirates"
## [31] "Yemen" "Zimbabwe"
world_data_clean <- world_data_clean[!is.na(world_data_clean$Minimum.wage), ]
world_data_clean[is.na(world_data_clean$Tax.revenue....),27]
## [1] "Chad" "Comoros" "Ecuador" "Guyana" "Haiti"
## [6] "Libya" "Mauritania" "Montenegro" "Panama" "Turkmenistan"
## [11] "Venezuela"
world_data_clean <- world_data_clean[!is.na(world_data_clean$Tax.revenue....), ]
Above, we had to pay attention to NAs in Armed forces, Minimum wage and Tax revenue specifically, because these variables had many NAs and can not be properly replaced. For Min.wage, unfortunately, the only way was to remove the countries with NAs, but luckily most of those countries were micronations. Countries with missing Tax.revenue were all micronations or closed countries, which are hard to get information about, therefore they were also removed.
world_data_clean[is.na(world_data_clean$Armed.Forces.size),27]
## [1] "Saint Vincent and the Grenadines" "Samoa"
## [3] "Solomon Islands" "Vanuatu"
world_data_clean$Armed.Forces.size[is.na(world_data_clean$Armed.Forces.size)] <- 0
colSums(is.na(world_data_clean))
## Density..P.Km2.
## 0
## Agricultural.Land....
## 0
## Land.Area.Km2.
## 0
## Armed.Forces.size
## 0
## Birth.Rate
## 0
## Co2.Emissions
## 0
## CPI
## 2
## CPI.Change....
## 1
## Fertility.Rate
## 0
## Forested.Area....
## 0
## Gasoline.Price
## 5
## GDP
## 0
## Gross.primary.education.enrollment....
## 1
## Gross.tertiary.education.enrollment....
## 2
## Infant.mortality
## 0
## Life.expectancy
## 0
## Maternal.mortality.ratio
## 0
## Minimum.wage
## 0
## Out.of.pocket.health.expenditure
## 0
## Physicians.per.thousand
## 0
## Population
## 0
## Population..Labor.force.participation....
## 2
## Tax.revenue....
## 0
## Total.tax.rate
## 0
## Unemployment.rate
## 2
## Urban_population
## 0
## Country
## 0
For Armed.Forces.size, it gets more interesting, because looking at the list of countries with missing Armed.Forces, we can conclude that there is no army in those small island-countries, and set those values to 0.
After seeing that we managed to reduce the number of NAs significantly, the only variables that are left with some NAs are those that we can simply replace with means, without removing important information or deleting the whole observations. Those variables are: Gasoline.price, Cpi.Change, Unemployment.rate, Gross.tertiary.education, and Gross.primary.education. We replace all these variables with means and have no missing values after that.
world_data_clean <- as.data.frame(
lapply(world_data_clean, function(x) {
replace(x, is.na(x), mean(x, na.rm = TRUE))
})
)
## Warning in mean.default(x, na.rm = TRUE): argument is not numeric or logical:
## returning NA
colSums(is.na(world_data_clean))
## Density..P.Km2.
## 0
## Agricultural.Land....
## 0
## Land.Area.Km2.
## 0
## Armed.Forces.size
## 0
## Birth.Rate
## 0
## Co2.Emissions
## 0
## CPI
## 0
## CPI.Change....
## 0
## Fertility.Rate
## 0
## Forested.Area....
## 0
## Gasoline.Price
## 0
## GDP
## 0
## Gross.primary.education.enrollment....
## 0
## Gross.tertiary.education.enrollment....
## 0
## Infant.mortality
## 0
## Life.expectancy
## 0
## Maternal.mortality.ratio
## 0
## Minimum.wage
## 0
## Out.of.pocket.health.expenditure
## 0
## Physicians.per.thousand
## 0
## Population
## 0
## Population..Labor.force.participation....
## 0
## Tax.revenue....
## 0
## Total.tax.rate
## 0
## Unemployment.rate
## 0
## Urban_population
## 0
## Country
## 0
worldPlot1 <- melt(world_data_clean[, 1:12])
## No id variables; using all as measure variables
ggplot(worldPlot1) +
geom_histogram(aes(x = value), bins = 30, fill = "grey70", color = "black") +
facet_wrap(~ variable, scales = "free", ncol = 3) +
theme_minimal() +
theme(
strip.text = element_text(size = 9),
plot.title = element_text(hjust = 0.5, size = 15)
) +
labs(title = "Socio-Economic Indicators Distribution (1–12)", x = NULL, y = "Count") +
theme_minimal()
Interesting facts to note: Population Density has couple of countries which are extreme outliers, probably like Bangladesh. Agricultural Land Proportion has almost normal distribution, since this is obviously a natural random phenomenon. Area by Km^2 also has couple of extreme outliers like Russia, USA, China, and Canada. Birth and Fertility Rate seem to be positively correlated. Gasoline.Price is also almost normally distributed, which is interesting.
worldPlot2 <- melt(world_data_clean[, 13:24])
## No id variables; using all as measure variables
ggplot(worldPlot2) +
geom_histogram(aes(x = value), bins = 30, fill = "grey70", color = "black") +
facet_wrap(~ variable, scales = "free", ncol = 3) +
theme_minimal() +
theme(
strip.text = element_text(size = 9),
plot.title = element_text(hjust = 0.5, size = 15)
) +
labs(title = "Socio-Economic Indicators Distribution (13–24)", x = NULL, y = "Count") +
theme_minimal()
In this batch, variables seem more related to one another, because there are many of them that are close to normal like Population Labor force participation, Total tax rate, Primary Education Enrollment. Others look like they have extreme outliers again, Ex: Minimum.wage is extremely high in Scandinavian countries, Switzerland, Luxembourg, and etc.
worldPlot3 <- melt(world_data_clean[, 24:26])
## No id variables; using all as measure variables
ggplot(worldPlot3) +
geom_histogram(aes(x = value), bins = 30, fill = "grey70", color = "black") +
facet_wrap(~ variable, scales = "free", ncol = 3) +
theme_minimal() +
theme(
strip.text = element_text(size = 9),
plot.title = element_text(hjust = 0.5, size = 15)
) +
labs(title = "Socio-Economic Indicators Distribution (25–26)", x = NULL, y = "Count") +
theme_minimal()
Here we can see that total tax rate and unemployment rate are close in distribution, and urban population has severe outliers (India and China).
country_names <- world_data_clean$Country
X <- world_data_clean %>% select(-Country)
world_data_clean_scaled <- scale(X)
cor_mat <- cor(world_data_clean_scaled)
corrplot::corrplot(cor_mat, tl.cex = 0.5)
From the correlation matrix we can see that Population is highly and positively correlated to Armed Forces, and Urban population which completely makes sense from real life. Maternal mortality ratio has high positive correlation to infant mortality ratio. Urban population to Co2 emission and armed forces. Population related statistics could be a good are to reduce the number of dimensions. Another good area would be related to education. Gross tertiary education enrollment has high negative correlation with Birth and Fertility rate, and therefore to infant and maternal mortality. On the opposite it has high positive correlation with physicians per thousand, indicating the development of the country.
KMO(cor_mat)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_mat)
## Overall MSA = 0.76
## MSA for each item =
## Density..P.Km2.
## 0.62
## Agricultural.Land....
## 0.34
## Land.Area.Km2.
## 0.70
## Armed.Forces.size
## 0.79
## Birth.Rate
## 0.77
## Co2.Emissions
## 0.75
## CPI
## 0.61
## CPI.Change....
## 0.53
## Fertility.Rate
## 0.76
## Forested.Area....
## 0.47
## Gasoline.Price
## 0.64
## GDP
## 0.77
## Gross.primary.education.enrollment....
## 0.32
## Gross.tertiary.education.enrollment....
## 0.91
## Infant.mortality
## 0.83
## Life.expectancy
## 0.81
## Maternal.mortality.ratio
## 0.91
## Minimum.wage
## 0.77
## Out.of.pocket.health.expenditure
## 0.70
## Physicians.per.thousand
## 0.88
## Population
## 0.69
## Population..Labor.force.participation....
## 0.65
## Tax.revenue....
## 0.87
## Total.tax.rate
## 0.57
## Unemployment.rate
## 0.46
## Urban_population
## 0.74
cortest.bartlett(cor_mat, n = 132)
## $chisq
## [1] 3247.735
##
## $p.value
## [1] 0
##
## $df
## [1] 325
In KMO test, our Overall MSA = 0.76, which is good and higher than 0.6 threshold to continue the analysis. In the bartlett’s test, the p-value = 0, which means that the correlation matrix is completely different from the identity matrix. According to these tests, the data is suitable for the PCA.
pca <- prcomp(world_data_clean_scaled, center = TRUE, scale. = TRUE)
fviz_eig(pca, choice = "eigenvalue", ncp = 22, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4", addlabels = TRUE, main = "Eigenvalues")
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.6096 2.1824 1.52706 1.28640 1.1739 1.1054 1.07849
## Proportion of Variance 0.2619 0.1832 0.08969 0.06365 0.0530 0.0470 0.04474
## Cumulative Proportion 0.2619 0.4451 0.53479 0.59844 0.6514 0.6984 0.74317
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.0500 0.99860 0.80669 0.77316 0.73523 0.69377 0.63934
## Proportion of Variance 0.0424 0.03835 0.02503 0.02299 0.02079 0.01851 0.01572
## Cumulative Proportion 0.7856 0.82393 0.84896 0.87195 0.89274 0.91125 0.92698
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.58768 0.57946 0.55562 0.51418 0.45107 0.43586 0.30805
## Proportion of Variance 0.01328 0.01291 0.01187 0.01017 0.00783 0.00731 0.00365
## Cumulative Proportion 0.94026 0.95317 0.96505 0.97522 0.98304 0.99035 0.99400
## PC22 PC23 PC24 PC25 PC26
## Standard deviation 0.2746 0.18681 0.16312 0.1018 0.09358
## Proportion of Variance 0.0029 0.00134 0.00102 0.0004 0.00034
## Cumulative Proportion 0.9969 0.99824 0.99926 0.9997 1.00000
Based on Kaiser’s rule we should choose 8 dimensions for this dataset as the most optimal number, because eigenvalue is >1 starting from 8 components.
fviz_pca_var(pca, col.var = "brown4")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
## Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This plot better visualizes the relation of variables and in which way they can be grouped into dimensions.
Land Area, GDP, Urban Population, Army size contribute a huge amount to the second dimension, and are all positively correlated. The first dimension has way more variables and therefore explains more variance, this dimension includes Fertility Rate, Birth Rate, Tertiary Education, and others. However, we can see that variables in the second dimension have more contribution per variable, which leads to the idea that those are the most important ones.
fviz_screeplot(pca, addlabels = TRUE, barfill = "salmon", barcolor = "tomato3", linecolor = "tomato4")
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.
From the scree plot, we can see that first two dimensions explain ~45% of the variance, which is not too terrible for our case, but it is not good. Adding a third dimension would make it ~55%, which makes it significantly better, and so on.
pca_var <- get_pca_var(pca)
fviz_contrib(pca, "var", axes = 1:8, fill = "tomato3", color = "tomato4")
As I noted before this plot, Urban Population, Co2 Emissions, Birth Rate, Army size are the variables that can be grouped in certain ways with others to reduce dimensions. We can see they have the highest contribution to the dimensions with Urban Population contributing to almost 5%, which confirms that Urban Population is strongest variable, even though it is in the second dimension which explains less variance than the first.
pca_scores <- pca$x[, 1:2]
plot(pca_scores,
xlab = "PC1",
ylab = "PC2",
main = "PCA of Countries Based on Socio-Economic Indicators",
type = "n")
text(pca_scores, labels = country_names, cex = 0.6)
dist_world <- dist(world_data_clean_scaled, method = "euclidean")
mds <- cmdscale(dist_world, k = 2)
plot(mds,
xlab = "Dimension 1",
ylab = "Dimension 2",
main = "MDS of Countries Based on Socio-Economic Indicators",
type = "n")
text(mds, labels = country_names, cex = 0.6)
From above image, we can see that China is the most extreme outlier, while US, India, and Russia are also outliers but not as far as China. China has more differences due to the state’s ideology. Other 3 countries are outliers mostly due to their huge size and population. In India’s case, the specificity is in population related metrics. In Russia’s case it is mostly about the area of the country. In US case, all of the metrics are much higher than the average, but not the highest in the world. This image is in 2D, which means that some of the conclusions about outliers might be untrue, since our 2 dimensions cover only 45% of the variance.
mds_smacof <- mds(dist_world, ndim = 2, type = "ratio")
mds_smacof$stress
## [1] 0.2242191
Due to extreme outliers and countries’ similarities not well-captured in 2D, we get a low stress value. The whole world data can not be smooth and highly related, which leads to impossibility of 2D interpretation. Increasing dimensionality to 3 substantially improves the goodness of fit, confirming that the structure of the data is not adequately represented in 2 dimensions.
mds_3d <- mds(dist_world, ndim = 3)
mds_3d$stress
## [1] 0.1560127
mds_df <- as.data.frame(mds)
mds_df$Country <- country_names
mds_df[order(mds_df$V1), ][1:5, ]
## V1 V2 Country
## 30 -5.792256 -17.2181765 China
## 127 -5.151687 -8.4683413 United States
## 48 -4.221929 1.6139743 Greece
## 8 -4.144360 0.1366599 Australia
## 86 -3.777922 1.5763425 Netherlands
mds_df[order(-mds_df$V1), ][1:5, ]
## V1 V2 Country
## 28 5.936277 -1.0034435 Central African Republic
## 90 5.672294 -2.7147704 Nigeria
## 107 5.654711 -0.6911262 Sierra Leone
## 89 4.847501 -0.5813831 Niger
## 76 4.730577 -0.5653628 Mali
Looking at first dimension specifically, Russia and India are not the top 5 biggest outliers. They are only such in the context of Land Area and Population together. Interesting fact is that our first 2 dimensions cover those 2 metrics by a lot. The first dimension outliers which are shown on the table cover such statistics as Birth rate, Fertility Rate, Education, and more. We can see China and US are still here as extreme outliers, and China is a huge outlier looking at the second dimension(which included GDP, Population, and etc.) which is a good observation. Greece, Australia, and Netherlands are added in the list which tells us much about cultural uniquness of those countries (low population increase, high number of physicians, and probably high focus on tertiary education). On the other hand, the complete opposite of these countries can be seen on the second table, those are 5 African countries with extremely high population increase, humanitarian crisis due to poor medicine, and low focus on education due to the unavailability of it.
dist_gower_world <- daisy(world_data_clean[, -27]
, metric = "gower")
mds_gower <- cmdscale(dist_gower_world, k = 2)
plot(mds_gower,
xlab = "Dimension 1",
ylab = "Dimension 2",
main = "MDS of Countries Based on Socio-Economic Indicators (Gower)",
type = "n")
text(mds_gower, labels = country_names, cex = 0.6)
We can see in this case, two dimensions are better captured, as more outliers are visible, since China is not so extreme. For example, Afghanistan, Sudan, Iran are far from the bigger groups.
mds_smacof_gower <- mds(dist_gower_world, ndim = 2, type = "ratio")
mds_smacof_gower$stress
## [1] 0.2015165
MDS with Gower distance gives us better stress score, let’s take a look at outliers and see if we can notice differences.
mds_gower_df <- as.data.frame(mds_gower)
mds_gower_df$Country <- country_names
mds_gower_df[order(mds_gower_df$V1), ][1:5, ]
## V1 V2 Country
## 127 -0.1690145 -0.054828382 United States
## 30 -0.1372541 -0.131520861 China
## 8 -0.1353132 0.010828047 Australia
## 48 -0.1347476 0.007557629 Greece
## 86 -0.1343809 0.024573341 Netherlands
mds_gower_df[order(-mds_gower_df$V1), ][1:5, ]
## V1 V2 Country
## 90 0.1997796 -0.124961518 Nigeria
## 28 0.1848038 0.063357506 Central African Republic
## 89 0.1639130 -0.001678436 Niger
## 114 0.1583318 -0.079280302 Sudan
## 76 0.1569685 0.002994005 Mali
The top outliers from both components’ perspective are exactly the same as with Euclidean distance.
To say couple of final words, the dimension reduction in context of global socio-economic factors per country can give us some valuable insights, but is not very strong. For example: China, USA, India, and Russia are clear global outliers from many perspectives, which we managed to retrieve through 2 dimensions. One dimension highlighted humanitarian crisis and population boom in African countries such as Central African Republic. This means that the answer to the first question is: “Yes, we can note global outliers via dimension reduction, but due to diversity of different factors, not all outliers are visible.” For instance, Arabic Gulf countries are very specific and rich, but first 2 dimensions is not able to capture that, since it covers other factors.
The similarity between PCA and classical MDS results is expected, as classical MDS applied to Euclidean distances is mathematically equivalent to PCA on centered data. MDS applied to Gower distances provides a slightly more clear picture of outliers, indicating Gower as a better metric in this case.
The most contributing factor is Urban Population, since it has other variables like Population and Co2 Emissions which are very closely correlated. Fertility and Birth rate are also the factors that highlight global inequality.
Viewing global data can be reduced in dimensions, but higher than 2 would let us extract much more knowledge from it. This leads to the potential extension of this project with the analysis of 3 and 4 dimensional data with tables and 3D view.