Investigation of relationship between democracy and GDP using Clustering methods

One can suggest that there exists some kind of relationship between the level of democracy and overall level of economic well-being in a country. Under autocracy, the real GDP is expected to be smaller than in case of democracy, and we can recall some examples from history. The problem is that there exists no unified democracy indicator, as well as economic well-being measure. Nevertheless, in this project, I tried to investigate whether there exists relationship between democracy level and economic well-being. 7 democracy measures proposed by V-Dem and 2 GDP measures (inflation adjusted in national prices and expenditure side expressed in USD) were used for cluster analysis.

The first step of the analysis is data preparation.The economic indicators were taken from World Penn Tables. The last available year is 2019, so data from this year was taken. “rgdpna” represents real GDP in national prices, whereas “cgdpe” is expenditure-side GDP.

library(readxl) 
economy<-read_excel("pwt1001_data.xlsx") 
economy1<-economy[economy$year==2019, ] 
economy2<-economy1[, c("country", "rgdpna", "cgdpe")] 
economy2<-as.data.frame(economy2)

head(economy2)

##                country      rgdpna       cgdpe
## 1                Aruba   3068.7583   3912.3347
## 2               Angola 222151.0625 227771.6094
## 3             Anguilla    223.4567    375.1364
## 4              Albania  37204.7734  35808.3438
## 5 United Arab Emirates 647986.2500 678241.1875
## 6            Argentina 975569.0000 991465.4375

After that, the operation was repeated with data on democracy indicators. The original data was taken from V-Dem database. “v2x_polyarchy” index measures to what extend the ideal of electoral democracy is achieved, “v2x_libdem” stands for liberal democracy measure, “v2x_partipdem” represents participatory democracy, “v2x_delibdem” shows to what extend the ideal of deliberative democracy is fulfilled, “v2x_egaldem” stands for egaliterian democracy index, “v2x_free_altinf” represents to what extend media is free and independent from government, and finally “v2xcl_rol” shows accessibility of justice for citizens.

democracy<-read.csv("V-Dem-CY-Full+Others-v14.csv")
demo1<-democracy[democracy$year==2019, ]
demo2<-demo1[, c('country_name','v2x_polyarchy','v2x_libdem',"v2x_partipdem", "v2x_delibdem", "v2x_egaldem", "v2x_freexp_altinf", "v2xcl_rol")]
colnames(demo2)<-c('country','v2x_polyarchy','v2x_libdem',"v2x_partipdem", "v2x_delibdem", "v2x_egaldem", "v2x_freexp_altinf", "v2xcl_rol")

head(demo2)

##           country v2x_polyarchy v2x_libdem v2x_partipdem v2x_delibdem
## 231        Mexico         0.675      0.434         0.412        0.421
## 355      Suriname         0.742      0.580         0.467        0.604
## 590        Sweden         0.901      0.871         0.649        0.833
## 816   Switzerland         0.907      0.866         0.805        0.877
## 938         Ghana         0.725      0.615         0.357        0.606
## 1062 South Africa         0.711      0.607         0.462        0.617
##      v2x_egaldem v2x_freexp_altinf v2xcl_rol
## 231        0.394             0.792     0.604
## 355        0.540             0.841     0.772
## 590        0.824             0.965     0.987
## 816        0.838             0.979     0.984
## 938        0.521             0.873     0.892
## 1062       0.474             0.845     0.841

Finally, we can combine data sets and delete empty observations

combined<-merge(demo2, economy2, by="country", all.x=TRUE, all.y=FALSE)
combined<-na.omit(combined)

head(combined)

##     country v2x_polyarchy v2x_libdem v2x_partipdem v2x_delibdem v2x_egaldem
## 2   Albania         0.517      0.432         0.314        0.265       0.375
## 3   Algeria         0.292      0.148         0.105        0.200       0.256
## 4    Angola         0.366      0.184         0.102        0.202       0.155
## 5 Argentina         0.789      0.631         0.526        0.602       0.621
## 6   Armenia         0.800      0.631         0.565        0.669       0.681
## 7 Australia         0.825      0.780         0.591        0.750       0.707
##   v2x_freexp_altinf v2xcl_rol     rgdpna      cgdpe
## 2             0.705     0.924   37204.77   35808.34
## 3             0.514     0.571  481677.12  487570.81
## 4             0.633     0.564  222151.06  227771.61
## 5             0.863     0.837  975569.00  991465.44
## 6             0.871     0.908   43260.20   41045.19
## 7             0.894     0.966 1315734.25 1277449.00

Before we can proceed to data analysis we must standardize the data

data_prep<-combined[, 2:10]
data_standart<-as.data.frame(lapply(data_prep, scale))

As the first step, I calculated Hopkins statistic

library("hopkins")
library("factoextra")
get_clust_tendency(data_standart, 8, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed=123)

## $hopkins_stat
## [1] 0.9319497
## 
## $plot

Hopkins statistic equals to 0.93199497, so we can suggest that the data is not random and clusterable.

Now we need to decide on optimal number of clusters for k-means proceedure. The first method applied is gap statistics.

library('gridExtra')
library('factoextra')
opt_kmeans_gap<-fviz_nbclust(data_standart, FUNcluster=kmeans, method="gap")+theme_classic()
grid.arrange(opt_kmeans_gap)

From the graph, we can see that the optimal number of clusters is two. Although it does not corresponds to the maximum gap statistic, it is the smallest value that satisfies the condition \(\text{Gap}(k)\geq\text{Gap}(k-1)-s_{k+1}\)

We can also compare the results with silhouette method.

library('ClusterR')
opt2<-Optimal_Clusters_KMeans(data_standart, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

As we can see from the graph, average silhouette score is maximized when number of clusters is two.

Then, we can utilize k-means procedure with two clusters

library('stats')
km_2cl<-kmeans(data_standart, 2)

To check the quality of clustering, I calculated silhouette metrics.

library('cluster')
sil1<-silhouette(km_2cl$cluster, dist(data_standart))
fviz_silhouette(sil1)

##   cluster size ave.sil.width
## 1       1   72          0.40
## 2       2   71          0.56

The average silhouette width is equal to 0.48, meaning that we can consider the clustering reasonable.

Add cluster assignments into the data set.

combined$cluster<-km_2cl$cluster

Now, we can calculate mean values of each indicator per cluster

means<-matrix(0, ncol=10, nrow=2)
means[,1]<-c(1,2)
colnames(means)<-c("cluster", colnames(combined[2:10]))
for (i in 2:10){
 means[, i]<-as.data.frame((aggregate(data=combined, combined[, i]~cluster , mean)))[, 2]
}
means<-as.data.frame(means)
means[, c(1:5)]

##   cluster v2x_polyarchy v2x_libdem v2x_partipdem v2x_delibdem
## 1       1     0.3398750  0.2156806     0.1952917    0.2252222
## 2       2     0.7666197  0.6594789     0.5254085    0.6486620

means[, c(6:10)]

##   v2x_egaldem v2x_freexp_altinf v2xcl_rol   rgdpna    cgdpe
## 1   0.2150278         0.5075139 0.5397222 652493.3 638262.2
## 2   0.6212113         0.8763099 0.8921549 641727.1 647574.5

It can clearly be seen that in case of the first cluster, the average of each democracy index is significantly lower that in case of the second one. However, the mean of GDP indicators do not have big difference. It might be so due to the fact that China, the country with the second highest GDP, was included into the first cluster.

To prepare a world map painted with clusters we need the following packages

library("rnaturalearth")
library("rnaturalearthdata")
library("sf")
library("dplyr")
library("ggplot2")

Now, we can extract the world data

world<-ne_countries(scale="medium", returnclass="sf")

Then, it is needed to combine this data with cluster assignments. Our data set has some missing countries, so we need to replace NA values for cluster assignment in world data set with some input. In our case it is “Missing data”

country_clust<-combined[, c(1,11)]
world_clustered<-world %>%
  left_join(country_clust, by=c("name"="country"))
world_clustered<-world_clustered %>%
  mutate(cluster=
ifelse(is.na(cluster), "Missing data", as.factor(cluster)))

Finally, we can proceed to the construction of the map

ggplot(world_clustered) +
  geom_sf(aes(fill = as.factor(cluster)), color = "black", size = 0.1) +
  scale_fill_manual(values = c("1" = "lightblue", "2" = "salmon"), na.value = "gray90") +
  theme_minimal() +
  labs(fill = "Cluster", title = "World Map with Clusters") +
  theme(panel.background = element_rect(fill = "white"))

From the map, we can derive conclusion that the countries from the first (relatively democratic) cluster are either located in Western Europe or were colonized by Western European countries in the past.

Let us proceed with PAM clustering.

PAM clustering

Derive optimal number of clusters

opt_md3<-fviz_nbclust(data_standart, FUNcluster=cluster::pam, method="silhouette")+theme_classic()
grid.arrange(opt_md3)

And confirm the result with gap statistic

opt_md4<-fviz_nbclust(data_standart, FUNcluster=cluster::pam, method="gap")+theme_classic()
grid.arrange(opt_md4)

According to silhouette method, optimal number of clusters is two. However, according to gap statistic, the optimal number is three.

PAM with 2 clusters

pam2cl<-eclust(data_standart, "pam", k=2, metric="euclidean")

Let us calculate silhouette.

fviz_silhouette(pam2cl)

##   cluster size ave.sil.width
## 1       1   80          0.38
## 2       2   63          0.59

To find the most appropriate number of clusters we need to repeat the procedure with 3 clusters.

pam3cl<-eclust(data_standart, "pam", k=3, metric="euclidean")

fviz_silhouette(pam3cl)

##   cluster size ave.sil.width
## 1       1   58          0.38
## 2       2   42          0.34
## 3       3   43          0.55

In case of two clusters, the average silhouette width is higher, which applies that utilizing two clusters might be more appropriate.

Let us add cluster assignemts into the table.

combined$clustersPAM<-pam2cl$clustering

Now, we can visualize the results on the world map.

country_clust2<-combined[, c(1,12)]
world_clustered2<-world %>%
  left_join(country_clust2, by=c("name"="country"))


world_clustered2<-world_clustered2 %>%
  mutate(clustersPAM=
ifelse(is.na(clustersPAM), "Missing data", as.factor(clustersPAM)))



ggplot(world_clustered2) +
  geom_sf(aes(fill = as.factor(clustersPAM)), color = "black", size = 0.1) +
  scale_fill_manual(values = c("1" = "lightblue", "2" = "salmon"), na.value = "gray90") +
  theme_minimal() +
  labs(fill = "Cluster", title = "World Map with Clusters") +
  theme(panel.background = element_rect(fill = "white"))

As we can see, the cluster assignments do not differ significantly in case of k-means and PAM clustering.

Hierarchical clustering

Define the appropriate number of clusters or “tree cuts”.

opt_tree<-fviz_nbclust(data_standart, FUNcluster=hcut, method="gap")+theme_classic()
grid.arrange(opt_tree)

And compare the results with silhouette method

opt_tree_silh<-fviz_nbclust(data_standart, FUNcluster=hcut, method="silhouette")+theme_classic()
grid.arrange(opt_tree_silh)

According to the gap method. the most appropriate number of clusters is two, whereas silhouette method shows that the optimal number is three.

Data preparation

combined_h<-combined
rownames(combined_h)<-combined$country
combined_h<-combined_h[, -1]
data_scaled<-scale(combined_h)

Now, lets decide upon the most appropriate method by calculating agglomerative coefficients.

hc2<-agnes(data_scaled, method="complete")
hc3<-agnes(data_scaled, method="ward")
hc4<-agnes(data_scaled, method="single")
hc5<-agnes(data_scaled, method="average")
ac_vec<-c(hc2$ac, hc3$ac, hc4$ac, hc5$ac)
ac_vec<-as.data.frame(ac_vec)
rownames(ac_vec)<-c("complete", "ward", "single", "average")
ac_vec

##             ac_vec
## complete 0.9580168
## ward     0.9843697
## single   0.9317682
## average  0.9543099

The Ward method gives the highest agglomerative coefficient, consequently we apply this method.

tree_ward<-pltree(hc3, cex=0.4, hang=-1, main="agnes-ward")

Let us cut the tree into 2 parts.

library('dendextend')
ward_cut<-cutree(as.hclust(hc3), k=2)
table(ward_cut)

## ward_cut
##  1  2 
## 72 71

ward_cut_3cl<-cutree(as.hclust(hc3), k=3)
table(ward_cut_3cl)

## ward_cut_3cl
##  1  2  3 
## 70 71  2

In case we apply 3 clusters, one of the clusters will have only 2 observations, which does not make much sense.

Cut tree dendrogram by 2 clusters.

plot(as.hclust(hc3), cex = 0.4)
rect.hclust(as.hclust(hc3), k = 2, border = 1:2)

To visualize the data, we need to add hierarchical clustering assignments.

combined$h_clusters<-ward_cut

country_clust_h<-combined[, c(1,13)]
world_clustered_h<-world %>%
  left_join(country_clust_h, by=c("name"="country"))


world_clustered_h<-world_clustered_h %>%
  mutate(h_clusters=
ifelse(is.na(h_clusters), "Missing data", as.factor(h_clusters)))



ggplot(world_clustered_h) +
  geom_sf(aes(fill = as.factor(h_clusters)), color = "black", size = 0.1) +
  scale_fill_manual(values = c("1" = "lightblue", "2" = "salmon"), na.value = "gray90") +
  theme_minimal() +
  labs(fill = "Cluster", title = "World Map with Clusters") +
  theme(panel.background = element_rect(fill = "white"))

As we can clearly see, in case we utilize 2 clusters for each of three algorithms, we get almost the same cluster assignment.

Summary

To analyse the relationship between the level of democracy and GDP, three clustering methods were applied. In all three cases, the most appropriate number of clusters was two. Based on the conducted analysis, it is difficult to say whether such relationship exists. There is a possibility that changing the sample or applying different methods might clarify it.

Sources:

1.https://www.rug.nl/ggdc/productivity/pwt/?lang=en, Penn World tables. 2.https://v-dem.net/data/the-v-dem-dataset/, V-Dem data set.

Investigation of relationship between democracy and GDP using Clustering methods

Kostiantyn Okhrimenko

2024-12-19