Introduction

Wealth and income inequalities currently rank among the most popular subjects in economic research and literature. The surge in interest regarding how economic reality is shaped by them can be largely attributed to the publication of Thomas Piketty’s Capital in the Twenty-First Century. Following this global trend, Polish academia has recently produced its first major comprehensive study on the effects of inequality and potential policy solutions: Inequalities in polish (Brzeziński, Sawulski, Bukowski, 2024).

This project aims to test the theses introduced by some authors regarding the mitigation of rising inequalities. Using clustering algorithms, I want to examine whether strengthening redistribution and addressing corruption will effectively differentiate countries in terms of their inequality levels.

The primary goal is to determine if a lower corruption perception and higher public expenditures (on healthcare and education) correlate with a lower concentration of wealth among the top 10% of the population.

DATA

The data used in this project comes from three distinct sources:

World Development Indicators (World Bank): From this database, I have extracted data on public expenditures for healthcare and education, expressed as a percentage of GDP. These variables represent the level of state investment in human capital and social services, and wiill be used as measure of redistribution.
World Inequality Database (WID): This database is managed by the World Inequality Lab, led by researchers such as Thomas Piketty and Gabriel Zucman. For this analysis, I am using the share of total wealth owned by the richest 10% of the population. This measure is preffered over the traditional Gini index, as WID researchers argue it provides a more accurate reflection of wealth concentration and structural inequality.
Varieties of Democracy (V-Dem): This dataset provides high-quality indicators of socio-economic and political phenomena. I will be using the Public Sector Corruption Index, which is constructed based on expert assessments from each analysed country. The index operates on an interval scale (from 0 to 1), where higher values represent higher corruption within the public sector.

# Reading necessary libraries
library(readxl)
library(WDI)
library(tidyr)
library(dplyr)
library(stringr)
library(countrycode)
library(corrplot)
library(factoextra)
library(ggplot2)
library(gridExtra)
library(cluster)
library(fpc)
library(knitr)

# Uploading data
# Loading World Development Indicators
wdi <- WDI(country = "all", 
           indicator = c("SE.XPD.TOTL.GD.ZS", "SH.XPD.CHEX.GD.ZS"), 
           start = 2000)

wdi <- wdi %>%
  rename(
    Year = year,
    Country = country, 
    Education = SE.XPD.TOTL.GD.ZS,
    Health = SH.XPD.CHEX.GD.ZS
  ) 

# Loading V-Dem data 
vdem <- readRDS("V-Dem-CY-Full+Others-v13.rds")
vdem <- subset(vdem, vdem$year >= 2000)
vdem <- vdem[c("country_name", "year", "v2x_pubcorr")]
vdem <- rename(vdem, Country = country_name, Year = year, Corruption = v2x_pubcorr)

# Adding ISO codes for easier merging
vdem$iso3c <- countrycode(sourcevar = vdem$Country, 
                          origin = "country.name", 
                          destination = "iso3c")

wid<- read_excel('WID_Dane.xlsx')
wid <- wid %>%
  mutate(
    type = case_when(startsWith(Var, "sptinc") ~"Income", startsWith(Var, "shweal")~"Wealth")
    )
wid$type <- paste(wid$type, wid$Code, sep = "_")
wid<-wid[wid$type == 'Wealth_p90p100',]
wid <- wid %>%
  select(Country, Year, type, Values) %>%
  pivot_wider(names_from = type, values_from = Values)
wid$iso3c <- countrycode(sourcevar = wid$Country,  origin = "country.name", destination = "iso3c")


#Merging Data on Country ISO code and adding Redistribution Variable sum of expedenturies on Health and Education


DF<- wdi%>%
  left_join(select(wid,-Country), by = c("iso3c", 'Year')) %>%
  left_join(select(vdem,-Country), by = c("iso3c", 'Year'))
DF$Education<- DF$Education/100
DF$Health <- DF$Health/100
DF$Redistribution <- DF$Education + DF$Health
DF <- subset(DF, select = -c(iso2c))
#extracting data for year 2021 and checking for missing values
df<- DF[DF$Year ==2021,]
rownames(df) <- NULL
kable(head(df, 10))

Country	iso3c	Year	Education	Health	Wealth_p90p100	Corruption	Redistribution
Afghanistan	AFG	2021	NA	0.2150844	0.5819	0.394	NA
Africa Eastern and Southern	AFE	2021	0.0436838	0.0594259	NA	NA	0.1031097
Africa Western and Central	AFW	2021	0.0309693	0.0413562	NA	NA	0.0723255
Albania	ALB	2021	0.0300556	0.0735750	0.5696	0.631	0.1036306
Algeria	DZA	2021	0.0551403	0.0502189	0.6167	0.611	0.1053592
American Samoa	ASM	2021	NA	NA	NA	NA	NA
Andorra	AND	2021	0.0238192	0.0864672	0.5730	NA	0.1102864
Angola	AGO	2021	0.0229720	0.0273987	0.6923	0.518	0.0503707
Antigua and Barbuda	ATG	2021	0.0250897	0.0481315	0.6048	NA	0.0732213
Arab World	ARB	2021	NA	0.0525929	NA	NA	NA

missing_values <- colSums(is.na(df))
complete_rows <-sum(complete.cases(df))
cat("Number of countries without missing data:", complete_rows)

## Number of countries without missing data: 147

kable(as.data.frame(missing_values), col.names = "Number of NAs", caption = "Missing values per variable")

Missing values per variable
	Number of NAs
Country	0
iso3c	0
Year	0
Education	57
Health	25
Wealth_p90p100	63
Corruption	92
Redistribution	66

Since clustering aims to identify unique patterns across nations. Given that redistribution and corruption levels are driven by country-specific factors, replacing missing data with measures like mean or median would diminish the authenticity of the results. Therefore, countries with incomplete records were removed The remaining dataset of 147 countries (for year 2021) provides enough information for completing analysis.

df1<-na.omit(df)
df1 <- df1[, c("Country","Year","iso3c","Education", "Health", "Corruption","Wealth_p90p100", "Redistribution")]
df_st<- scale(df1[c("Education", "Health", "Corruption","Wealth_p90p100", "Redistribution")])
df_st <- as.data.frame(df_st)
cor_matrix<- cor(df_st[,c("Education", "Health", "Redistribution" ,"Corruption", "Wealth_p90p100")],method = 'spearman')
df_st$Country = df1$Country
df_st$iso3c<- df1$iso3c
df_st$Year<- df1$Year
df_st <- df_st[!duplicated(df_st$iso3c), ]
rownames(df_st) <- df_st$iso3c
corrplot(cor_matrix)

Judging by the correlation matrix, the relationship between redistribution, corruption, and inequalities appears less significant than expected (low correlation). Additionally, I observe a strong negative correlation between redistribution and corruption, suggesting that higher social spending often coexists with lower levels of perceived corruption.

K-MEANS

Before performing k-means algorithm i need to check if data show tendency for clustering. I will also try to find optimal number of clusters with silhouette statistics.

hop_corruption <-get_clust_tendency(df_st[,c("Wealth_p90p100", "Corruption")], n=nrow(df_st)-1, graph=FALSE)
hop_corruption

## $hopkins_stat
## [1] 0.7579532
## 
## $plot
## NULL

hop_redistribution<-get_clust_tendency(df_st[,c("Wealth_p90p100", "Redistribution")], n=nrow(df_st)-1, graph=FALSE)
hop_redistribution

## $hopkins_stat
## [1] 0.7838144
## 
## $plot
## NULL

p_sil1 <- fviz_nbclust(df_st[, c("Corruption", "Wealth_p90p100")], 
                       kmeans, method = "silhouette") + labs(title = "Corruption")

p_sil2 <- fviz_nbclust(df_st[, c("Redistribution", "Wealth_p90p100")], 
                       kmeans, method = "silhouette") + labs(title = "Redistribution")


grid.arrange(p_sil1, p_sil2, ncol = 1)

Both the corruption index and redistribution spending are suitable for clustering, as the Hopkins statistic is approximately 0.8 for both. According to the silhouette width, the optimal number of clusters for these variables is three.

Having assessed the clusterability of the data and determined the appropriate number of clusters, it is now time to perform the K-Means clustering analysis.

km_corr <- eclust(df_st[, c("Corruption", "Wealth_p90p100")], k=3, FUNcluster="kmeans", hc_metric="euclidean", graph=F)
df_st$cluster_corr <- as.factor(km_corr$cluster)

km_red <- eclust(df_st[, c("Redistribution", "Wealth_p90p100")], k=3, FUNcluster="kmeans", hc_metric="euclidean", graph=F)
df_st$cluster_red <- as.factor(km_red$cluster)

km_corr_plot <- fviz_cluster(km_corr, data = df_st,  elipse.type="convex", geom=c("point")) + ggtitle("Corruption Clustering")
sil_corr_km <- fviz_silhouette(km_corr)

##   cluster size ave.sil.width
## 1       1   65          0.58
## 2       2   20          0.33
## 3       3   57          0.47

km_red_plot <- fviz_cluster(km_red, data = df_st,  elipse.type="convex", geom=c("point")) + ggtitle("Redistribution Clustering")
sil_red_km <- fviz_silhouette(km_red)

##   cluster size ave.sil.width
## 1       1   67          0.46
## 2       2   17          0.34
## 3       3   58          0.41

grid.arrange(km_corr_plot, sil_corr_km, km_red_plot, sil_red_km, ncol=2)

The silhouette score of around 0.5 suggests a ‘weak’ cluster structure. While it isn’t a perfect separation, this is expected in cross-country analysis. Nations don’t fit into rigid categories they often exist on a spectrum. Before analyzing results I’d like to see if performing clustering with PAM algorithm instead will help increasing average silhouette.

PAM

pam_corr <- eclust(df_st[, c("Corruption", "Wealth_p90p100")], k=3, FUNcluster="pam", hc_metric="manhattan", graph=F)
df_st$cluster_corr_pam <- as.factor(pam_corr$cluster)

pam_red <- eclust(df_st[, c("Redistribution", "Wealth_p90p100")], k=3, FUNcluster="pam", hc_metric="manhattan", graph=F)
df_st$cluster_corr_pam <- as.factor(pam_red$cluster)

pam_corr_plot <- fviz_cluster(pam_corr, data = df_st,  elipse.type="convex", geom=c("point")) + ggtitle("Corruption Clustering")
sil_corr_pam <- fviz_silhouette(pam_corr)

##   cluster size ave.sil.width
## 1       1   63          0.60
## 2       2   20          0.29
## 3       3   59          0.45

pam_red_plot <- fviz_cluster(pam_red, data = df_st,  elipse.type="convex", geom=c("point")) + ggtitle("Redistribution Clustering")
sil_red_pam <- fviz_silhouette(pam_red)

##   cluster size ave.sil.width
## 1       1   52          0.50
## 2       2   49          0.10
## 3       3   41          0.43

grid.arrange(pam_corr_plot, sil_corr_pam, pam_red_plot, sil_red_pam, ncol=2)

Unfortunately introducing PAM algorithm didn’t help. So i’ll focus on K-Means result while making final conclusions

Conclusions

df_st$region <- countrycode(sourcevar = df_st$iso3c, origin = "iso3c", destination = "region")


k1 <- fviz_cluster(km_corr, data = df_st,  elipse.type="convex", geom= "text") + ggtitle("Corruption Clustering")

p1<- ggplot(df_st, aes(x = region, fill = region)) + 
  geom_bar() + 
  facet_wrap(~cluster_corr, ncol = 2) +  
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  labs(title = "Number of countries from each region by clusters",
       x = "Region",
       y = "Number of Countries",
       fill = "Region")

k2<- fviz_cluster(km_red, data = df_st ,elipse.type="convex",  geom= "text") + ggtitle("Redistribution Clustering")

p2 <- ggplot(df_st, aes(x = region, fill = region)) + 
  geom_bar() + 
  facet_wrap(~cluster_red, ncol = 2) +  
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  labs(title = "Number of countries from each region by clusters",
       x = "Region",
       y = "Number of Countries",
       fill = "Region")
k1

p1

The clustering results reveal a relationship between institutional quality and wealth distribution. Cluster 3 represents the “ideal” model, dominated by European nations like Norway, Belgium, and Denmark, which achieve the combination of minimal corruption and low wealth inequality. In contrast, Cluster 1 (primarily Sub-Saharan Africa, Central Asia, and post-Soviet states like Ukraine, Bulgaria, and Moldova) shows that low wealth inequality can coexist with higher corruption.

The most interesting results appear in Cluster 2, where the algorithm groups advanced economies like the USA and Sweden with highly corrupted nations such as South Africa (RSA), Mozambique, and Bahrain. This cluster shows that with strong institutions and economy do not always resylt in fair wealth distribution

k2

p2

When it comes to redistribution cluster 3 represents the most successful social models, where high government spending actually leads to lower wealth inequality. It is led by European nations like the Netherlands, Denmark, and Belgium. This cluster also includes developing countries such as Rwanda, El Salvador, and Honduras. Although their inequality level does not differ much from countries in cluster 1 their level of expenditures on social services is the reason why algorithm classified them together with European welfare states.

High spending on redistribution does not necessarily eliminate high inequality, as one can see in cluster 2. This group includes Latin American and African nations (Mexico, Peru, Namibia) alongside Sweden and the USA. This confirms that even massive state intervention (as seen in Sweden) may not be enough to offset deep-seated wealth concentration. Finally, cluster 1 represents a “Low-Budget Stability” model (e.g., Mali, Vietnam, Cambodia, and Haiti), where inequality remains low (concentrated around zero) despite minimal redistribution, likely due to a lack of accumulated private wealth across the population.

Maćko_Wnuk_Clustering

2026-01-20

Introduction

DATA

K-MEANS

PAM

Conclusions