This project investigates whether the global inequalities can be divided into two distinct, independent dimensions: the economic and the social. While economic inequalities are quantifiable and relatively easy to measure, social inequalities—such as gender disparity or life expectancy—are often more qualitative and harder to define.
A central debate in development economics is whether social inequalities aren’t just a derivative of economic ones. One might argue that the concentration of capital naturally dictates access to healthcare and education, making social outcomes a secondary effect of wealth distribution. My goal is to test this hypothesis by reducing multiple inequality indicators into two principal components (PCA). By doing so, I aim to determine if social and economic measurements load into separate dimensions and whether these dimensions are orthogonal. If they are, it would demonstrate that social inequality is not just a byproduct of economic wealth, but a separate phenomenon that requires its own unique policy interventions.
Finally, this study explores the nature of environmental inequality, represented by the carbon footprint of the top 10%. The goal is to determine whether this variable aligns with the economic or the social dimension. Specifically, I aim to test if high emissions are merely a function of financial capability (the rich can afford to emit more) or if they reflect broader social structures, such as consumption norms
For this project I took data from two sources:
World Inequality Database (WID): This database is managed by the World Inequality Lab, led by researchers such as Thomas Piketty and Gabriel Zucman. For this analysis, I am using the share of total wealth and income owned by the richest 1% of the population and precantage of carbon-dioxide emssion caused by the richest 10%.
Human Development Report Database created by United Nations Development Programme. This will be my source for measurments of social inequalities, such as gender, education, and life expectancy inequalities.
# Reading necessary libraries
library(readxl)
library(WDI)
library(tidyr)
library(dplyr)
library(stringr)
library(countrycode)
library(corrplot)
library(psych)
library(factoextra)
library(gridExtra)
library(knitr)
# Uploading data
# Loading World Inequalities Data
wid<- read_excel('WID_Data_DR.xlsx')
wid<- wid %>%
mutate(type = case_when(startsWith(Indicator, "sptinc") ~ 'Income', startsWith(Indicator, 'lpfghg') ~ "CO2Emission", startsWith(Indicator, 'shweal') ~'Wealth'))
wid$type <- paste(wid$type, wid$Precentile, sep = '_')
wid<-pivot_wider(data = wid, id_cols = c(Country, Year),names_from = type, values_from = Value)
wid$iso3c<-countrycode(sourcevar = wid$Country, origin = "country.name", destination = "iso3c")
#Loading data from Human Development Report
undp<- read_excel('hdr_undp.xlsx')
undp<- pivot_wider(data = undp, id_cols = c(country, year) ,names_from = indicatorCode, values_from = value )
undp$iso3c<-countrycode(sourcevar = undp$country, origin = "country.name", destination = "iso3c")
undp<- rename(undp, Year = year, Country = country)
undp$gii<- as.numeric(undp$gii)
undp$ineq_edu<- as.numeric(undp$ineq_edu)
undp$ineq_le<- as.numeric(undp$ineq_le)
undp$Year<- as.numeric(undp$Year)
wid<- wid[!is.na(wid$iso3c),]
undp<- undp[!is.na(undp$iso3c),]
#Merging and handling missing values
DF <- wid %>%
left_join(select(undp, -Country), by = c("iso3c", "Year"))
DF$ineq_edu <- DF$ineq_edu/100
DF$ineq_le <- DF$ineq_le/100
DF$CO2Emission_p90p100 <- DF$CO2Emission_p90p100/100
df<- DF[DF$Year == 2019,] #extracting data for year 2019
cat("Number of observations without missing variables:",sum(complete.cases(df)))
## Number of observations without missing variables: 155
missing_values <- colSums(is.na(df))
kable(as.data.frame(missing_values), col.names = "Number of NAs", caption = "Missing values per variable")
| Number of NAs | |
|---|---|
| Country | 0 |
| Year | 0 |
| Income_p99p100 | 6 |
| Wealth_p99p100 | 6 |
| CO2Emission_p90p100 | 54 |
| iso3c | 0 |
| gii | 46 |
| ineq_edu | 32 |
| ineq_le | 21 |
df<-na.omit(df)
kable(head(df, 10))
| Country | Year | Income_p99p100 | Wealth_p99p100 | CO2Emission_p90p100 | iso3c | gii | ineq_edu | ineq_le |
|---|---|---|---|---|---|---|---|---|
| Afghanistan | 2019 | 0.1469 | 0.2362 | 0.0341253 | AFG | 0.676 | 0.45365 | 0.28075 |
| Albania | 2019 | 0.0934 | 0.2247 | 0.1067012 | ALB | 0.131 | 0.12333 | 0.06452 |
| Algeria | 2019 | 0.2261 | 0.2801 | 0.0791766 | DZA | 0.385 | 0.33283 | 0.13008 |
| Angola | 2019 | 0.2584 | 0.3841 | 0.0616115 | AGO | 0.536 | 0.34171 | 0.29883 |
| Argentina | 2019 | 0.1482 | 0.2457 | 0.1816059 | ARG | 0.283 | 0.05787 | 0.07992 |
| Armenia | 2019 | 0.1457 | 0.2341 | 0.0989824 | ARM | 0.216 | 0.02935 | 0.07804 |
| Australia | 2019 | 0.0992 | 0.2300 | 0.5162757 | AUS | 0.079 | 0.04306 | 0.03537 |
| Austria | 2019 | 0.0876 | 0.2958 | 0.3616485 | AUT | 0.052 | 0.02855 | 0.03354 |
| Azerbaijan | 2019 | 0.1345 | 0.2310 | 0.1572052 | AZE | 0.314 | 0.04045 | 0.12254 |
| Bahamas | 2019 | 0.2266 | 0.3103 | 0.2951544 | BHS | 0.335 | 0.05143 | 0.14526 |
summary(df)
## Country Year Income_p99p100 Wealth_p99p100
## Length:155 Min. :2019 Min. :0.0681 Min. :0.1383
## Class :character 1st Qu.:2019 1st Qu.:0.1212 1st Qu.:0.2405
## Mode :character Median :2019 Median :0.1528 Median :0.2685
## Mean :2019 Mean :0.1610 Mean :0.2829
## 3rd Qu.:2019 3rd Qu.:0.2016 3rd Qu.:0.3105
## Max. :2019 Max. :0.3134 Max. :0.5472
## CO2Emission_p90p100 iso3c gii ineq_edu
## Min. :0.005952 Length:155 Min. :0.0140 Min. :0.01369
## 1st Qu.:0.072093 Class :character 1st Qu.:0.1595 1st Qu.:0.05912
## Median :0.165056 Mode :character Median :0.3370 Median :0.14627
## Mean :0.216709 Mean :0.3362 Mean :0.18562
## 3rd Qu.:0.273318 3rd Qu.:0.5005 3rd Qu.:0.28663
## Max. :1.436777 Max. :0.8160 Max. :0.50124
## ineq_le
## Min. :0.02331
## 1st Qu.:0.04767
## Median :0.10057
## Mean :0.13302
## 3rd Qu.:0.21141
## Max. :0.40913
The first step of the analysis is to examine the correlation matrix. My hypothesis is that variables will show strong intra-group correlation (within the economic and social categories) but weak inter-group correlation.
df_num <- select(df, -c(Country, Year, iso3c))
View(df_num)
correlation<-cor(df_num, method = 'spearman')
corrplot(correlation, type = 'lower')
Correlation matrix seems to confirm my assumptions, social variable (gender, education, and life expectancy inequalities) and economic (Wealth and income) are highly correlated within their groups, but not outside of them. Another thing that can be read from this matrix is high negative correlation of CO2 variable with social features, and low with the economic ones.
KMO(correlation)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = correlation)
## Overall MSA = 0.79
## MSA for each item =
## Income_p99p100 Wealth_p99p100 CO2Emission_p90p100 gii
## 0.67 0.57 0.91 0.75
## ineq_edu ineq_le
## 0.93 0.79
cortest.bartlett(correlation, n = nrow(df_num))
## $chisq
## [1] 813.2199
##
## $p.value
## [1] 1.276562e-163
##
## $df
## [1] 15
The preliminary tests confirm that the dataset is suitable for Principal Component Analysis. The KMO score of 0.79 indicates ‘middling to meritorious’ sampling adequacy, meaning the variables share enough common variance to be grouped into dimensions. Null hypotesis of Bartlett’s test was rejected, so I can assume that the variables are related and perform dimension reduction.
pca <- prcomp(df_num, center = TRUE, scale = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.7840 1.2879 0.76347 0.53912 0.44963 0.28814
## Proportion of Variance 0.5304 0.2764 0.09715 0.04844 0.03369 0.01384
## Cumulative Proportion 0.5304 0.8069 0.90403 0.95247 0.98616 1.00000
eig_val <- get_eigenvalue(pca)
eig_val$eigenvalue
## [1] 3.18256635 1.65871422 0.58287887 0.29064895 0.20216410 0.08302751
fviz_eig(pca, choice = 'eigenvalue', addlabels = TRUE, main = "Eigenvalues", barfill = 'hotpink2')
To test my hypothesis regarding the dual nature of inequality, it was necessary to use only the first two principal components (PC1 and PC2). This choice is strongly supported by the statistical evidence: the Scree Plot shows that only the first two dimensions have eigenvalues greater than one (Kaiser’s Criterion), while PC3 falls significantly below this threshold. Furthermore, these two components together explain 80% of the total variance in the dataset.
fviz_pca_var(pca,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
title = "Correlation Circle",
xlab = "PC1",
ylab = "PC2")
PC1 <- fviz_contrib(pca, choice = "var", axes = 1)
PC2 <- fviz_contrib(pca, choice = "var", axes = 2)
grid.arrange(PC1, PC2, ncol =1)
The Correlation Circle confirms the clear division between economic and social inequalities. The variables representing social dimensions—Gender Inequality (GII), Life Expectancy, and Education—align almost perfectly with PC2. According to the contribution plots, these variables are the primary drivers of this dimension, each contributing between 25% and 30%. Their near-perpendicular orientation relative to the economic variables provides strong evidence that social inequality is an independent phenomenon, rather than a derivative of financial wealth.
The economic dimension is almost entirely defined by Wealth and Income Inequality, which dominate PC1, with each contributing over 40%. This orthogonality suggests that a country’s economic distribution does not automatically dictate its social outcomes.
Regarding Environmental Inequality (\(CO_{2}\) p90p100), the analysis shows that its contribution falls below the significance threshold for both dimensions. While its overall impact is lower than that of the other indicators, it leans more towards the social dimension (nearly 15% contribution) than the economic one (under 10%). This suggests that extreme carbon footprints are more closely linked to society structure and development patterns than to capital accumulation alone.