One can suggest that there exists some kind of relationship between the level of democracy and overall level of economic well-being in a country. Under autocracy, the real GDP is expected to be smaller than in case of democracy, and we can recall some examples from history. The problem is that there exists no unified democracy indicator, as well as economic well-being measure. Nevertheless, in this project, I tried to investigate whether there exists relationship between democracy level and economic well-being. 7 democracy measures proposed by V-Dem and 2 GDP measures (inflation adjusted in national prices and expenditure side expressed in USD) were used for cluster analysis.
The first step of the analysis is data preparation.The economic indicators were taken from World Penn Tables. The last available year is 2019, so data from this year was taken. “rgdpna” represents real GDP in national prices, whereas “cgdpe” is expenditure-side GDP.
library(readxl)
economy<-read_excel("pwt1001_data.xlsx")
economy1<-economy[economy$year==2019, ]
economy2<-economy1[, c("country", "rgdpna", "cgdpe")]
economy2<-as.data.frame(economy2)
head(economy2)
## country rgdpna cgdpe
## 1 Aruba 3068.7583 3912.3347
## 2 Angola 222151.0625 227771.6094
## 3 Anguilla 223.4567 375.1364
## 4 Albania 37204.7734 35808.3438
## 5 United Arab Emirates 647986.2500 678241.1875
## 6 Argentina 975569.0000 991465.4375
After that, the operation was repeated with data on democracy indicators. The original data was taken from V-Dem database. “v2x_polyarchy” index measures to what extend the ideal of electoral democracy is achieved, “v2x_libdem” stands for liberal democracy measure, “v2x_partipdem” represents participatory democracy, “v2x_delibdem” shows to what extend the ideal of deliberative democracy is fulfilled, “v2x_egaldem” stands for egaliterian democracy index, “v2x_free_altinf” represents to what extend media is free and independent from government, and finally “v2xcl_rol” shows accessibility of justice for citizens.
democracy<-read.csv("V-Dem-CY-Full+Others-v14.csv")
demo1<-democracy[democracy$year==2019, ]
demo2<-demo1[, c('country_name','v2x_polyarchy','v2x_libdem',"v2x_partipdem", "v2x_delibdem", "v2x_egaldem", "v2x_freexp_altinf", "v2xcl_rol")]
colnames(demo2)<-c('country','v2x_polyarchy','v2x_libdem',"v2x_partipdem", "v2x_delibdem", "v2x_egaldem", "v2x_freexp_altinf", "v2xcl_rol")
head(demo2)
## country v2x_polyarchy v2x_libdem v2x_partipdem v2x_delibdem
## 231 Mexico 0.675 0.434 0.412 0.421
## 355 Suriname 0.742 0.580 0.467 0.604
## 590 Sweden 0.901 0.871 0.649 0.833
## 816 Switzerland 0.907 0.866 0.805 0.877
## 938 Ghana 0.725 0.615 0.357 0.606
## 1062 South Africa 0.711 0.607 0.462 0.617
## v2x_egaldem v2x_freexp_altinf v2xcl_rol
## 231 0.394 0.792 0.604
## 355 0.540 0.841 0.772
## 590 0.824 0.965 0.987
## 816 0.838 0.979 0.984
## 938 0.521 0.873 0.892
## 1062 0.474 0.845 0.841
Finally, we can combine data sets and delete empty observations
combined<-merge(demo2, economy2, by="country", all.x=TRUE, all.y=FALSE)
combined<-na.omit(combined)
head(combined)
## country v2x_polyarchy v2x_libdem v2x_partipdem v2x_delibdem v2x_egaldem
## 2 Albania 0.517 0.432 0.314 0.265 0.375
## 3 Algeria 0.292 0.148 0.105 0.200 0.256
## 4 Angola 0.366 0.184 0.102 0.202 0.155
## 5 Argentina 0.789 0.631 0.526 0.602 0.621
## 6 Armenia 0.800 0.631 0.565 0.669 0.681
## 7 Australia 0.825 0.780 0.591 0.750 0.707
## v2x_freexp_altinf v2xcl_rol rgdpna cgdpe
## 2 0.705 0.924 37204.77 35808.34
## 3 0.514 0.571 481677.12 487570.81
## 4 0.633 0.564 222151.06 227771.61
## 5 0.863 0.837 975569.00 991465.44
## 6 0.871 0.908 43260.20 41045.19
## 7 0.894 0.966 1315734.25 1277449.00
Before we can proceed to data analysis we must standardize the data
data_prep<-combined[, 2:10]
data_standart<-as.data.frame(lapply(data_prep, scale))
As the first step, I calculated Hopkins statistic
library("hopkins")
library("factoextra")
get_clust_tendency(data_standart, 8, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed=123)
## $hopkins_stat
## [1] 0.9319497
##
## $plot
Hopkins statistic equals to 0.93199497, so we can suggest that the data is not random and clusterable.
Now we need to decide on optimal number of clusters for k-means proceedure. The first method applied is gap statistics.
library('gridExtra')
library('factoextra')
opt_kmeans_gap<-fviz_nbclust(data_standart, FUNcluster=kmeans, method="gap")+theme_classic()
grid.arrange(opt_kmeans_gap)
From the graph, we can see that the optimal number of clusters is two. Although it does not corresponds to the maximum gap statistic, it is the smallest value that satisfies the condition \(\text{Gap}(k)\geq\text{Gap}(k-1)-s_{k+1}\)
We can also compare the results with silhouette method.
library('ClusterR')
opt2<-Optimal_Clusters_KMeans(data_standart, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
As we can see from the graph, average silhouette score is maximized when number of clusters is two.
Then, we can utilize k-means procedure with two clusters
library('stats')
km_2cl<-kmeans(data_standart, 2)
To check the quality of clustering, I calculated silhouette metrics.
library('cluster')
sil1<-silhouette(km_2cl$cluster, dist(data_standart))
fviz_silhouette(sil1)
## cluster size ave.sil.width
## 1 1 72 0.40
## 2 2 71 0.56
The average silhouette width is equal to 0.48, meaning that we can consider the clustering reasonable.
Add cluster assignments into the data set.
combined$cluster<-km_2cl$cluster
Now, we can calculate mean values of each indicator per cluster
means<-matrix(0, ncol=10, nrow=2)
means[,1]<-c(1,2)
colnames(means)<-c("cluster", colnames(combined[2:10]))
for (i in 2:10){
means[, i]<-as.data.frame((aggregate(data=combined, combined[, i]~cluster , mean)))[, 2]
}
means<-as.data.frame(means)
means[, c(1:5)]
## cluster v2x_polyarchy v2x_libdem v2x_partipdem v2x_delibdem
## 1 1 0.3398750 0.2156806 0.1952917 0.2252222
## 2 2 0.7666197 0.6594789 0.5254085 0.6486620
means[, c(6:10)]
## v2x_egaldem v2x_freexp_altinf v2xcl_rol rgdpna cgdpe
## 1 0.2150278 0.5075139 0.5397222 652493.3 638262.2
## 2 0.6212113 0.8763099 0.8921549 641727.1 647574.5
It can clearly be seen that in case of the first cluster, the average of each democracy index is significantly lower that in case of the second one. However, the mean of GDP indicators do not have big difference. It might be so due to the fact that China, the country with the second highest GDP, was included into the first cluster.
To prepare a world map painted with clusters we need the following packages
library("rnaturalearth")
library("rnaturalearthdata")
library("sf")
library("dplyr")
library("ggplot2")
Now, we can extract the world data
world<-ne_countries(scale="medium", returnclass="sf")
Then, it is needed to combine this data with cluster assignments. Our data set has some missing countries, so we need to replace NA values for cluster assignment in world data set with some input. In our case it is “Missing data”
country_clust<-combined[, c(1,11)]
world_clustered<-world %>%
left_join(country_clust, by=c("name"="country"))
world_clustered<-world_clustered %>%
mutate(cluster=
ifelse(is.na(cluster), "Missing data", as.factor(cluster)))
Finally, we can proceed to the construction of the map
ggplot(world_clustered) +
geom_sf(aes(fill = as.factor(cluster)), color = "black", size = 0.1) +
scale_fill_manual(values = c("1" = "lightblue", "2" = "salmon"), na.value = "gray90") +
theme_minimal() +
labs(fill = "Cluster", title = "World Map with Clusters") +
theme(panel.background = element_rect(fill = "white"))
From the map, we can derive conclusion that the countries from the first (relatively democratic) cluster are either located in Western Europe or were colonized by Western European countries in the past.
Let us proceed with PAM clustering.
PAM clustering
Derive optimal number of clusters
opt_md3<-fviz_nbclust(data_standart, FUNcluster=cluster::pam, method="silhouette")+theme_classic()
grid.arrange(opt_md3)
And confirm the result with gap statistic
opt_md4<-fviz_nbclust(data_standart, FUNcluster=cluster::pam, method="gap")+theme_classic()
grid.arrange(opt_md4)
According to silhouette method, optimal number of clusters is two. However, according to gap statistic, the optimal number is three.
PAM with 2 clusters
pam2cl<-eclust(data_standart, "pam", k=2, metric="euclidean")
Let us calculate silhouette.
fviz_silhouette(pam2cl)
## cluster size ave.sil.width
## 1 1 80 0.38
## 2 2 63 0.59
To find the most appropriate number of clusters we need to repeat the procedure with 3 clusters.
pam3cl<-eclust(data_standart, "pam", k=3, metric="euclidean")
fviz_silhouette(pam3cl)
## cluster size ave.sil.width
## 1 1 58 0.38
## 2 2 42 0.34
## 3 3 43 0.55
In case of two clusters, the average silhouette width is higher, which applies that utilizing two clusters might be more appropriate.
Let us add cluster assignemts into the table.
combined$clustersPAM<-pam2cl$clustering
Now, we can visualize the results on the world map.
country_clust2<-combined[, c(1,12)]
world_clustered2<-world %>%
left_join(country_clust2, by=c("name"="country"))
world_clustered2<-world_clustered2 %>%
mutate(clustersPAM=
ifelse(is.na(clustersPAM), "Missing data", as.factor(clustersPAM)))
ggplot(world_clustered2) +
geom_sf(aes(fill = as.factor(clustersPAM)), color = "black", size = 0.1) +
scale_fill_manual(values = c("1" = "lightblue", "2" = "salmon"), na.value = "gray90") +
theme_minimal() +
labs(fill = "Cluster", title = "World Map with Clusters") +
theme(panel.background = element_rect(fill = "white"))
As we can see, the cluster assignments do not differ significantly in case of k-means and PAM clustering.
Hierarchical clustering
Define the appropriate number of clusters or “tree cuts”.
opt_tree<-fviz_nbclust(data_standart, FUNcluster=hcut, method="gap")+theme_classic()
grid.arrange(opt_tree)
And compare the results with silhouette method
opt_tree_silh<-fviz_nbclust(data_standart, FUNcluster=hcut, method="silhouette")+theme_classic()
grid.arrange(opt_tree_silh)
According to the gap method. the most appropriate number of clusters is two, whereas silhouette method shows that the optimal number is three.
Data preparation
combined_h<-combined
rownames(combined_h)<-combined$country
combined_h<-combined_h[, -1]
data_scaled<-scale(combined_h)
Now, lets decide upon the most appropriate method by calculating agglomerative coefficients.
hc2<-agnes(data_scaled, method="complete")
hc3<-agnes(data_scaled, method="ward")
hc4<-agnes(data_scaled, method="single")
hc5<-agnes(data_scaled, method="average")
ac_vec<-c(hc2$ac, hc3$ac, hc4$ac, hc5$ac)
ac_vec<-as.data.frame(ac_vec)
rownames(ac_vec)<-c("complete", "ward", "single", "average")
ac_vec
## ac_vec
## complete 0.9580168
## ward 0.9843697
## single 0.9317682
## average 0.9543099
The Ward method gives the highest agglomerative coefficient, consequently we apply this method.
tree_ward<-pltree(hc3, cex=0.4, hang=-1, main="agnes-ward")
Let us cut the tree into 2 parts.
library('dendextend')
ward_cut<-cutree(as.hclust(hc3), k=2)
table(ward_cut)
## ward_cut
## 1 2
## 72 71
ward_cut_3cl<-cutree(as.hclust(hc3), k=3)
table(ward_cut_3cl)
## ward_cut_3cl
## 1 2 3
## 70 71 2
In case we apply 3 clusters, one of the clusters will have only 2 observations, which does not make much sense.
Cut tree dendrogram by 2 clusters.
plot(as.hclust(hc3), cex = 0.4)
rect.hclust(as.hclust(hc3), k = 2, border = 1:2)
To visualize the data, we need to add hierarchical clustering assignments.
combined$h_clusters<-ward_cut
country_clust_h<-combined[, c(1,13)]
world_clustered_h<-world %>%
left_join(country_clust_h, by=c("name"="country"))
world_clustered_h<-world_clustered_h %>%
mutate(h_clusters=
ifelse(is.na(h_clusters), "Missing data", as.factor(h_clusters)))
ggplot(world_clustered_h) +
geom_sf(aes(fill = as.factor(h_clusters)), color = "black", size = 0.1) +
scale_fill_manual(values = c("1" = "lightblue", "2" = "salmon"), na.value = "gray90") +
theme_minimal() +
labs(fill = "Cluster", title = "World Map with Clusters") +
theme(panel.background = element_rect(fill = "white"))
As we can clearly see, in case we utilize 2 clusters for each of three algorithms, we get almost the same cluster assignment.
Summary
To analyse the relationship between the level of democracy and GDP, three clustering methods were applied. In all three cases, the most appropriate number of clusters was two. Based on the conducted analysis, it is difficult to say whether such relationship exists. There is a possibility that changing the sample or applying different methods might clarify it.
Sources:
1.https://www.rug.nl/ggdc/productivity/pwt/?lang=en, Penn World tables. 2.https://v-dem.net/data/the-v-dem-dataset/, V-Dem data set.