Introduction

Definition denotes International Political Economy as the field that focuses on a particular range of questions and a series of assumptions about the nature of the international system and how it is understood. The critical part of its concentrate on a set of questions. Four main aspects of it concentrate on: politics, economics, international politics, and international economics. International Political Economy is apparently linked to foreign policy that impact of IPE is difficult to assess (Spence 1995).

Dataset preprocessing

The nature of the dataset

In order to provide master dataset (Graham and Tucker 2016) that can be used by researchers from the international political economy 89 data resources have been merged. Observations are identified by country-year the unit of analysis. Countries are identified by Gleditsch-Ward number in an alternative version by Correlates of War (COW). Most of the dataset components begin after Second World War. Each of 89 datasets has been given a unique suffix that uniquely identifies it.

Preprocessing

Firstly data is loaded, then it is converted to tibble as tidyverse will be mainly used in this project.

load(paste0("data/", "master_ipe_v4.rdata"))
ipe_mapping <- xlsx::read.xlsx2("data\\ipe_mapping.xlsx", sheetIndex = 1)
ipe_v4 <- as_tibble(ipe_v4)

Dimensions of the dataset need to be investigated as well.

ipe_v4 %>% dim()

[1] 25850 1043

A master table consists of 25850 rows and 1043 variables. Each of them is described in the codebook provided by library maintainers. They are divided into following groups:

  • economic
  • political
  • social and cultural
  • geographic
  • other (e.g. infrastructure, military)

Very good quality of the data is provided by Varieties of Democracy Dataset (VDEM) inside political section, it will definitely be included not only because of but also its popularity and recommendation during classes.

The other dataset chosen by me is going to be “Polity IV Democracy”.

pol_ds <- 
  ipe_v4 %>% 
  filter(country == "Poland", year > 1918, 
         !(year %in% c(1939, 1940, 1941, 1942, 1943, 1944, 1945, 2020))) %>% 
  select(year, starts_with("v2"), contains("P4"), 
         -c(change_P4, sf_P4, fragment_P4, regtrans_P4, countryname_raw_P4, 
            v2x_hosinter_VDEM, v2x_suffr_VDEM, v2elsrgel_VDEM, v2elreggov_VDEM,
            v2xlg_leginter_VDEM, v2x_elecreg_VDEM, v2xlg_elecreg_VDEM)) 

In every modeling process, we should perform exploratory data analysis before moving to the next points. Thanks to the broad landscape of autoEDA packages (Staniak and Biecek 2019) we can very easily generate some intuition towards the analyzed data. The chosen subset of data is associated with the brilliant starting point for analysis having only two columns, v2x_gender_VDEM and v2x_genpp_VDEM, with a missing value. The graphs show the distribution for each variable. The year column shows that there is no data between the years 1939-1948. Most of the variables there are so-called indexes with a range between 0 and 1. For P4 variables the minimum value can be lower than zero. Durable_P4 variable is highly skewed. For v2xcl_rol_VDEM we can clearly see an opposite pattern, however there is a missing range of numbers for which we have little to no observations.

dfSummary(pol_ds, plain.ascii = FALSE, style = "grid", 
          graph.magnif = 0.75, valid.col = FALSE, tmp.img.dir = "/tmp")
## temporary images written to 'E:\tmp'

Data Frame Summary

pol_ds
Dimensions: 89 x 61
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 year
[numeric]
Mean (sd) : 1970.8 (29)
min < med < max:
1919 < 1973 < 2017
IQR (CV) : 44 (0)
89 distinct values 0
(0.0%)
2 v2x_polyarchy_VDEM
[numeric]
Mean (sd) : 0.4 (0.3)
min < med < max:
0.2 < 0.3 < 0.9
IQR (CV) : 0.7 (0.7)
54 distinct values 0
(0.0%)
3 v2x_api_VDEM
[numeric]
Mean (sd) : 0.6 (0.3)
min < med < max:
0.3 < 0.5 < 1
IQR (CV) : 0.6 (0.4)
53 distinct values 0
(0.0%)
4 v2x_mpi_VDEM
[numeric]
Mean (sd) : 0.3 (0.4)
min < med < max:
0 < 0 < 0.9
IQR (CV) : 0.8 (1.3)
38 distinct values 0
(0.0%)
5 v2x_EDcomp_thick_VDEM
[numeric]
Mean (sd) : 0.5 (0.3)
min < med < max:
0.2 < 0.3 < 0.9
IQR (CV) : 0.6 (0.5)
55 distinct values 0
(0.0%)
6 v2x_libdem_VDEM
[numeric]
Mean (sd) : 0.4 (0.3)
min < med < max:
0.1 < 0.2 < 0.9
IQR (CV) : 0.6 (0.8)
58 distinct values 0
(0.0%)
7 v2x_liberal_VDEM
[numeric]
Mean (sd) : 0.6 (0.2)
min < med < max:
0.3 < 0.6 < 1
IQR (CV) : 0.5 (0.4)
57 distinct values 0
(0.0%)
8 v2x_partipdem_VDEM
[numeric]
Mean (sd) : 0.3 (0.3)
min < med < max:
0 < 0.1 < 0.7
IQR (CV) : 0.5 (1)
57 distinct values 0
(0.0%)
9 v2x_partip_VDEM
[numeric]
Mean (sd) : 0.3 (0.2)
min < med < max:
0.1 < 0.3 < 0.7
IQR (CV) : 0.5 (0.8)
39 distinct values 0
(0.0%)
10 v2x_delibdem_VDEM
[numeric]
Mean (sd) : 0.3 (0.3)
min < med < max:
0 < 0.2 < 0.8
IQR (CV) : 0.6 (1)
58 distinct values 0
(0.0%)
11 v2xdl_delib_VDEM
[numeric]
Mean (sd) : 0.4 (0.3)
min < med < max:
0 < 0.4 < 0.9
IQR (CV) : 0.7 (0.7)
42 distinct values 0
(0.0%)
12 v2x_egaldem_VDEM
[numeric]
Mean (sd) : 0.4 (0.2)
min < med < max:
0.2 < 0.2 < 0.8
IQR (CV) : 0.5 (0.6)
52 distinct values 0
(0.0%)
13 v2x_egal_VDEM
[numeric]
Mean (sd) : 0.8 (0.1)
min < med < max:
0.6 < 0.8 < 0.9
IQR (CV) : 0.1 (0.1)
28 distinct values 0
(0.0%)
14 v2x_frassoc_thick_VDEM
[numeric]
Mean (sd) : 0.5 (0.4)
min < med < max:
0.1 < 0.4 < 0.9
IQR (CV) : 0.8 (0.8)
52 distinct values 0
(0.0%)
15 v2x_freexp_altinf_VDEM
[numeric]
Mean (sd) : 0.5 (0.4)
min < med < max:
0.1 < 0.6 < 1
IQR (CV) : 0.8 (0.7)
37 distinct values 0
(0.0%)
16 v2xme_altinf_VDEM
[numeric]
Mean (sd) : 0.5 (0.4)
min < med < max:
0.1 < 0.5 < 0.9
IQR (CV) : 0.8 (0.8)
20 distinct values 0
(0.0%)
17 v2xel_frefair_VDEM
[numeric]
Mean (sd) : 0.5 (0.3)
min < med < max:
0.2 < 0.3 < 1
IQR (CV) : 0.7 (0.6)
51 distinct values 0
(0.0%)
18 v2x_elecoff_VDEM
[numeric]
Mean (sd) : 0.9 (0.3)
min < med < max:
0 < 1 < 1
IQR (CV) : 0 (0.3)
0.00 : 9 (10.1%)
0.83 : 4 ( 4.5%)
1.00 : 76 (85.4%)
0
(0.0%)
19 v2xcl_rol_VDEM
[numeric]
Mean (sd) : 0.7 (0.2)
min < med < max:
0.3 < 0.8 < 1
IQR (CV) : 0.5 (0.3)
46 distinct values 0
(0.0%)
20 v2x_jucon_VDEM
[numeric]
Mean (sd) : 0.7 (0.2)
min < med < max:
0.4 < 0.7 < 1
IQR (CV) : 0.4 (0.3)
25 distinct values 0
(0.0%)
21 v2xlg_legcon_VDEM
[numeric]
Mean (sd) : 0.5 (0.3)
min < med < max:
0.1 < 0.3 < 0.9
IQR (CV) : 0.7 (0.7)
27 distinct values 0
(0.0%)
22 v2x_cspart_VDEM
[numeric]
Mean (sd) : 0.5 (0.3)
min < med < max:
0.1 < 0.4 < 0.9
IQR (CV) : 0.7 (0.7)
29 distinct values 0
(0.0%)
23 v2xdd_dd_VDEM
[numeric]
Mean (sd) : 0.1 (0.1)
min < med < max:
0 < 0 < 0.2
IQR (CV) : 0.1 (1.5)
14 distinct values 0
(0.0%)
24 v2xel_locelec_VDEM
[numeric]
Mean (sd) : 0.4 (0.4)
min < med < max:
0 < 0.3 < 1
IQR (CV) : 0.9 (1)
14 distinct values 0
(0.0%)
25 v2xel_regelec_VDEM
[numeric]
Mean (sd) : 0.2 (0.3)
min < med < max:
0 < 0.1 < 0.9
IQR (CV) : 0.2 (1.4)
19 distinct values 0
(0.0%)
26 v2xeg_eqprotec_VDEM
[numeric]
Mean (sd) : 0.7 (0.1)
min < med < max:
0.6 < 0.7 < 0.9
IQR (CV) : 0.1 (0.2)
12 distinct values 0
(0.0%)
27 v2xeg_eqdr_VDEM
[numeric]
Mean (sd) : 0.8 (0.1)
min < med < max:
0.5 < 0.9 < 0.9
IQR (CV) : 0 (0.2)
22 distinct values 0
(0.0%)
28 v2xcs_ccsi_VDEM
[numeric]
Mean (sd) : 0.5 (0.3)
min < med < max:
0.1 < 0.4 < 0.9
IQR (CV) : 0.7 (0.6)
29 distinct values 0
(0.0%)
29 v2xps_party_VDEM
[numeric]
Mean (sd) : 0.8 (0.1)
min < med < max:
0.6 < 0.7 < 0.9
IQR (CV) : 0.2 (0.1)
25 distinct values 0
(0.0%)
30 v2x_gender_VDEM
[numeric]
Mean (sd) : 0.7 (0.1)
min < med < max:
0.5 < 0.7 < 1
IQR (CV) : 0.3 (0.2)
49 distinct values 1
(1.1%)
31 v2x_gencl_VDEM
[numeric]
Mean (sd) : 0.7 (0.2)
min < med < max:
0.4 < 0.8 < 0.9
IQR (CV) : 0.3 (0.2)
28 distinct values 0
(0.0%)
32 v2x_gencs_VDEM
[numeric]
Mean (sd) : 0.6 (0.2)
min < med < max:
0.5 < 0.5 < 0.9
IQR (CV) : 0.4 (0.3)
30 distinct values 0
(0.0%)
33 v2x_genpp_VDEM
[numeric]
Mean (sd) : 0.8 (0.2)
min < med < max:
0.5 < 0.9 < 1
IQR (CV) : 0.3 (0.2)
28 distinct values 1
(1.1%)
34 v2xex_elecreg_VDEM
[integer]
Min : 0
Mean : 0.3
Max : 1
0 : 61 (68.5%)
1 : 28 (31.5%)
0
(0.0%)
35 v2xel_elecparl_VDEM
[integer]
Min : 0
Mean : 0.3
Max : 1
0 : 65 (73.0%)
1 : 24 (27.0%)
0
(0.0%)
36 v2xel_elecpres_VDEM
[integer]
Min : 0
Mean : 0.1
Max : 1
0 : 83 (93.3%)
1 : 6 ( 6.7%)
0
(0.0%)
37 v2x_corr_VDEM
[numeric]
Mean (sd) : 0.2 (0)
min < med < max:
0.1 < 0.2 < 0.2
IQR (CV) : 0.1 (0.3)
31 distinct values 0
(0.0%)
38 v2x_pubcorr_VDEM
[numeric]
Mean (sd) : 0.2 (0.1)
min < med < max:
0.1 < 0.2 < 0.4
IQR (CV) : 0.2 (0.3)
23 distinct values 0
(0.0%)
39 v2x_execorr_VDEM
[numeric]
Mean (sd) : 0.1 (0)
min < med < max:
0.1 < 0.2 < 0.2
IQR (CV) : 0.1 (0.3)
21 distinct values 0
(0.0%)
40 v2x_divparctrl_VDEM
[numeric]
Mean (sd) : -0.4 (1.1)
min < med < max:
-1.6 < -0.3 < 1.3
IQR (CV) : 2.2 (-2.6)
23 distinct values 0
(0.0%)
41 v2x_feduni_VDEM
[numeric]
Mean (sd) : 0.3 (0.4)
min < med < max:
0 < 0.1 < 0.9
IQR (CV) : 0.4 (1.1)
19 distinct values 0
(0.0%)
42 v2x_civlib_VDEM
[numeric]
Mean (sd) : 0.7 (0.2)
min < med < max:
0.3 < 0.7 < 1
IQR (CV) : 0.5 (0.4)
54 distinct values 0
(0.0%)
43 v2x_clphy_VDEM
[numeric]
Mean (sd) : 0.7 (0.2)
min < med < max:
0.3 < 0.8 < 1
IQR (CV) : 0.4 (0.3)
27 distinct values 0
(0.0%)
44 v2x_clpol_VDEM
[numeric]
Mean (sd) : 0.5 (0.3)
min < med < max:
0.1 < 0.6 < 1
IQR (CV) : 0.8 (0.6)
43 distinct values 0
(0.0%)
45 v2x_clpriv_VDEM
[numeric]
Mean (sd) : 0.7 (0.2)
min < med < max:
0.3 < 0.8 < 0.9
IQR (CV) : 0.3 (0.3)
37 distinct values 0
(0.0%)
46 v2cltrnslw_VDEM
[numeric]
Mean (sd) : 1.1 (0.7)
min < med < max:
0.1 < 1.2 < 2.1
IQR (CV) : 1.1 (0.6)
14 distinct values 0
(0.0%)
47 v2juhcind_VDEM
[numeric]
Mean (sd) : 0.9 (1.1)
min < med < max:
-0.6 < 0.8 < 2.6
IQR (CV) : 2.6 (1.2)
19 distinct values 0
(0.0%)
48 v2juhccomp_VDEM
[numeric]
Mean (sd) : 1.3 (0.6)
min < med < max:
0.2 < 1 < 2.2
IQR (CV) : 1 (0.4)
14 distinct values 0
(0.0%)
49 v2excrptps_VDEM
[numeric]
Mean (sd) : 0.7 (0.3)
min < med < max:
0.2 < 0.9 < 1.3
IQR (CV) : 0.6 (0.5)
18 distinct values 0
(0.0%)
50 v2jucorrdc_VDEM
[numeric]
Mean (sd) : 1.9 (0.5)
min < med < max:
1.3 < 2.2 < 2.5
IQR (CV) : 1.1 (0.3)
9 distinct values 0
(0.0%)
51 v2exbribe_VDEM
[numeric]
Mean (sd) : 1.4 (0.4)
min < med < max:
0.9 < 1.3 < 2.3
IQR (CV) : 0.8 (0.3)
18 distinct values 0
(0.0%)
52 v2lgcrrpt_VDEM
[numeric]
Mean (sd) : 0.8 (0.2)
min < med < max:
0.2 < 0.8 < 1.4
IQR (CV) : 0.2 (0.3)
13 distinct values 0
(0.0%)
53 democ_P4
[numeric]
Mean (sd) : 3.8 (4.4)
min < med < max:
0 < 2 < 10
IQR (CV) : 9 (1.1)
0 : 44 (49.4%)
2 : 9 (10.1%)
5 : 2 ( 2.2%)
8 : 11 (12.4%)
9 : 7 ( 7.9%)
10 : 16 (18.0%)
0
(0.0%)
54 autoc_P4
[numeric]
Mean (sd) : 3.9 (3.3)
min < med < max:
0 < 5 < 8
IQR (CV) : 7 (0.8)
0 : 36 (40.4%)
5 : 9 (10.1%)
6 : 7 ( 7.9%)
7 : 35 (39.3%)
8 : 2 ( 2.2%)
0
(0.0%)
55 polity_P4
[numeric]
Mean (sd) : -0.1 (7.6)
min < med < max:
-8 < -3 < 10
IQR (CV) : 16 (-75.3)
-8 : 2 ( 2.2%)
-7 : 35 (39.3%)
-6 : 7 ( 7.9%)
-3 : 9 (10.1%)
5 : 2 ( 2.2%)
8 : 11 (12.4%)
9 : 7 ( 7.9%)
10 : 16 (18.0%)
0
(0.0%)
56 polity2_P4
[numeric]
Mean (sd) : -0.1 (7.6)
min < med < max:
-8 < -3 < 10
IQR (CV) : 16 (-75.3)
-8 : 2 ( 2.2%)
-7 : 35 (39.3%)
-6 : 7 ( 7.9%)
-3 : 9 (10.1%)
5 : 2 ( 2.2%)
8 : 11 (12.4%)
9 : 7 ( 7.9%)
10 : 16 (18.0%)
0
(0.0%)
57 durable_P4
[numeric]
Mean (sd) : 14.4 (11.6)
min < med < max:
0 < 12 < 41
IQR (CV) : 19 (0.8)
42 distinct values 0
(0.0%)
58 xconst_P4
[numeric]
Mean (sd) : 4.7 (1.9)
min < med < max:
2 < 4 < 7
IQR (CV) : 4 (0.4)
2 : 2 ( 2.2%)
3 : 42 (47.2%)
4 : 2 ( 2.2%)
5 : 9 (10.1%)
6 : 4 ( 4.5%)
7 : 30 (33.7%)
0
(0.0%)
59 parreg_P4
[numeric]
Mean (sd) : 3.7 (1)
min < med < max:
2 < 4 < 5
IQR (CV) : 0 (0.3)
2 : 20 (22.5%)
4 : 53 (59.6%)
5 : 16 (18.0%)
0
(0.0%)
60 parcomp_P4
[numeric]
Mean (sd) : 2.5 (1.6)
min < med < max:
1 < 2 < 5
IQR (CV) : 3 (0.6)
1 : 37 (41.6%)
2 : 16 (18.0%)
3 : 7 ( 7.9%)
4 : 13 (14.6%)
5 : 16 (18.0%)
0
(0.0%)
61 polcomp_P4
[numeric]
Mean (sd) : 4.4 (3.9)
min < med < max:
1 < 2 < 10
IQR (CV) : 8 (0.9)
1 : 37 (41.6%)
2 : 16 (18.0%)
7 : 7 ( 7.9%)
9 : 13 (14.6%)
10 : 16 (18.0%)
0
(0.0%)

Clustering

Clustering is an unsupervised learning technique that uses machine learning algorithms to assign similar data to groups. Mostly it is used for knowledge discovery of hidden patterns found within data. The main idea is to cluster similar observations so that they are grouped together. The more diverse groups the better. In an ideal world, the results obtained from clustering should not only have good statistical properties (compact, well-separated, connected and stable), but also yield output that is relevant in terms of processed data.

Clusterability of the dataset

There is a need to scale data before a further analysis. What is more NAs should be filled with chosen technique.

pol_ds_scaled <- pol_ds %>% fill(v2x_gender_VDEM, v2x_genpp_VDEM) %>% scale %>% as_tibble()

The first step that this analysis will cover is the clusterability of the dataset. Agnes function provides the data scientist with the agglomerative coefficient that measures the amount of clustering structure found plus an easy and novel way to visualize the banner. A coefficient that is close to one suggests the high clustering structure found in the dataset.

hc2 <- agnes(pol_ds_scaled)
print(hc2$ac)

[1] 0.9184494

The value shows that chosen dataset is a very good starting point for the clustering analysis.

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
  agnes(pol_ds_scaled, method = x)$ac
}
map_dbl(m, ac)

average single complete ward 0.9184494 0.8827847 0.9333211 0.9850385

Other methods also yield very promising results.

Optimal number of clusters

In order to keep high heterogeneity between the clusters and low inside them an appropriate number of clusters should be chosen. The very simple method is defined as the square root of (n/2) where n is denoted as an optimal number of clusters.

sqrt((pol_ds_scaled %>% nrow()) / 2)

[1] 6.670832

Now we are coming back to the elbow method.

opt <- Optimal_Clusters_KMeans(pol_ds_scaled, max_clusters=10, plot_clusters = TRUE)

Elbow method and showed within-group heterogeneity shows that 6 clusters are an appropriate number of clusters, others may argue that 3 is the starting point of the linear trend decrease.

opt <- Optimal_Clusters_KMeans(pol_ds_scaled, 
                               max_clusters=10, 
                               'euclidean', 
                               plot_clusters = TRUE, 
                               criterion = 'silhouette')

The other method uses silhouette index to find the best number of groups inside the data set which shows the maximum value for 2 clusters and lowest for 6. Measure method takes into account what is the difference between points that belong to the same cluster and what is the difference between different clusters.

The methods presented above are not very formalized due to their graphical way of presenting results. One way to address this drawback is a statistical procedure called gap statistics (Tibshirani, Walther, and Hastie 2001). The concept is based on the graph of log(W_k) by making a comparison with its expectation under an appropriate null reference distribution of the data. For the latter one appropriate reference distribution needs to be found in the data. Then the estimation of the optimal number of clusters is the value of k for which log(W_k) falls the farthest below this reference curve. One of the assumptions of this algorithm is to have data that can be separated well.

clusGapRes <- clusGap(pol_ds_scaled, FUN = kmeans, nstart = 20, K.max = 20, B = 60)

The number of clusters proposed by this sophisticated method is equal to 13 which is presented by “Number of clusters (method ‘firstSEmax’, SE.factor=1): 13”. Mainly due to its advanced methodology, this number has been chosen in further analysis.

chosen_size <- 13

Clustering based on partition assumes that cluster center is to regard the center of data points. The main pros of these kind of algorithms are their time complexity and their ability to compute results very well in general. In terms of drawbacks we can list the following points:

  • do not work well with convex data,
  • are sensitive to the outliers,
  • can be stuck in the local optimal,
  • the number of clusters before the process must be specified
  • it is essentially very sensitive to the number of clusters.

K-means

Definition of K-means

K-means is probably the most widely used algorithm for unsupervised learning. In the algorithm, we shall specify a number of clusters. A goal is pretty straightforward, to minimize the differences within-cluster and maximize the differences between clusters. The way the algorithm works are as follows, each of the points is assigned to the cluster, then it updates the assignments by adjusting cluster boundaries according to the examples that currently fall into the cluster. Points are assigned to the clusters based on a chosen distance metric. Then centers are recalculated. The process lasts until no further improvements can be made in the process. Improvement to this algorithm is K-medoids which also deal with discrete data points, as the representative of the corresponding cluster.

Clustering

pol_clusters <- kmeans(pol_ds_scaled %>% as.matrix(), chosen_size)
pol_ds$kmeans <- pol_clusters$cluster
fviz_cluster(pol_clusters, data = pol_ds_scaled)

Evaluation indicators

One of the metric for k-means clustering can be checking what is the ratio between cluster sum of squares and the total sum of squares:

(pol_clusters$betweenss /pol_clusters$totss)*100

[1] 94.50958

94.50958 is the total variance in the data that is explained by the clustering. Thanks to clustering the algorithm is able to reduce sum of squares to 94.5%.

sil <- silhouette(x = pol_ds$kmeans, dist(pol_ds_scaled))
fviz_silhouette(sil)

cluster size ave.sil.width 1 1 7 0.73 2 2 4 0.22 3 3 4 0.25 4 4 3 0.19 5 5 4 0.63 6 6 9 0.51 7 7 8 0.41 8 8 12 0.28 9 9 1 0.00 10 10 10 0.57 11 11 4 0.60 12 12 7 0.69 13 13 16 0.37

PAM

Definition of PAM

PAM focuses on objects that are centrally located in a cluster. The procedure chooses set of medoids and iteratively replace one of the medoids with one of the non-medoids and check if it improves the total distance of the resulting clustering. The algorithm stops when there is no further change.

Clustering

c1<-pam(pol_ds_scaled, chosen_size)
pol_ds$pam <- c1$clustering
fviz_cluster(c1, geom="point", ellipse.type="norm") 
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

CLARA

Definition of CLARA

CLARA draws multiple samples of the dataset and then applies PAM on each sample. Therefore the main weakness of the PAM procedure is omitted.

Clustering

clara_flex <- eclust(pol_ds_scaled %>% as.matrix(), "clara", k=chosen_size) 

pol_ds$clara <- clara_flex$clustering

After assigning we can easily show counts per cluster.

fviz_cluster(clara_flex, geom="point", ellipse.type="norm") 
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

Hierarchical clustering

The main advantage of hierarchical clustering over k-means is that there is no need to implicitly specify the number of clusters. The algorithm itself proceeds iteratively, until there is just a single cluster. The algorithm is very often visualized via dendogram which is a tree structure. Two main types of hierarchical clustering are widely discussed which are agglomerative [bottom up] and divisive [top down]. These algorithms work well with arbitrary shapes and types. The algorithms are often able to detect hierarchical relationships which can be useful in many disciplines such as biology or economics. Unfortunately, these types of algorithms do not work efficiently with large data sets.

Agglomerative Nesting

hc3 <- agnes(pol_ds_scaled, method = "ward")
pltree(hc3, cex = 0.6, hang = -1, main = "dendrogram - agnes")

In order to perform hierarchical clustering euclidean distance is calculated first then we cut the dendogram tree at k = 6. The chosen method is Ward’s method which is usually fine by default.

d_pol_ds <- dist(pol_ds_scaled, method = "euclidean")
hc5 <- hclust(d_pol_ds, method = "ward.D2" )
sub_grp <- cutree(hc5, k = chosen_size)

pol_ds <- 
  pol_ds %>%
  mutate(hier_ward_D2 = sub_grp) 
fviz_cluster(list(data = pol_ds_scaled, cluster = pol_ds$hier_ward_D2))

Divisive clustering

hc4 <- diana(pol_ds_scaled)
pltree(hc4, cex = 0.6, hang = -1, main = "dendrogram - diana")

sub_grp <- cutree(hc4, k = chosen_size)

pol_ds <- 
  pol_ds %>%
  mutate(hier_diana = sub_grp) 

In order to perform hierarchical clustering euclidean distance is calculated first then we cut the dendogram tree at k = 6. The chosen method is Ward’s method which is usually fine by default.

Evaluation indicators

Having discussed mostly known methods other types of cluster validation can be discussed as well such as “internal” “stability” from clValid package (Brock et al. 2008). For internal measures selected measures are defined the compactness, connectedness, and separation of the cluster partitions. Connectedness is evaluated depending on the number of the nearest points that are assigned to the same cluster. Compactness relates to inner cluster homogeneity, very often being measured by looking at the intra-cluster variance. Distance between cluster centroids is often being measured when separation evaluation is presented. The non-linear combination of the compactness and separation is described by the Dunn index and silhouette width. Stability is evaluated in terms of consistency of clustering by comparing it with the clusters obtained after each column is removed, one at a time.

intern <- clValid(pol_ds_scaled %>% as.matrix(), 
                  chosen_size, 
                  clMethods = 
                    c("hierarchical", "kmeans", "diana", "agnes", "pam",
                      "clara", "model"), 
                  validation = "internal"
                  )
intern@measures %>% 
  as_tibble() %>% 
  pivot_longer(cols = everything()) %>% 
  separate(name, into = c("number", "model_name"), sep="[.]") %>% 
  select(model_name, value) %>% 
  mutate( measure_id = 0:20 %/% 7) %>% 
  inner_join(tibble(measure_id = 0:2, measure_name = c("Connectivity", "Dunn", "Silhouette"))) %>%
  select(-measure_id) %>% 
  pivot_wider(names_from = measure_name, values_from = value) %>% 
  kable_head()
## Joining, by = "measure_id"
[1] "
model_name Connectivity Dunn Silhouette
hierarchical 39.46508 0.4966301 0.4219525
kmeans 44.47103 0.1928479 0.4250111
diana 42.39206 0.4604724 0.4176306
agnes 39.46508 0.4966301 0.4219525
pam 50.07302 0.1536358 0.4156764
clara 54.58730 0.1536358 0.4130862
model 95.48611 0.0322389 0.1411220

"

The highest connectivity is associated with PAM, CLARA, and model clustering algorithms. It should be minimized so taking into account this technique the best algorithm is diana. Silhouette value is smallest for model-based and the maximum for diana algorithm. Based on this data we can assume diana as the best algorithm.

Choosing appropriate names for clusters

Based on per cluster results we can try to name formed clusters based on these values in terms of the international economy.

pol_ds_long <- 
  pol_ds %>% 
  select(-year, -kmeans, -pam,
  -clara, -hier_ward_D2) %>% 
  select(starts_with("v2x_"), hier_diana) %>% 
  pivot_longer(cols = c(-hier_diana)) %>% 
  inner_join(ipe_mapping, by = c("name" = "variable"))
 
pol_ds_long %>% 
  ggplot(aes(x = factor(hier_diana), 
             y = value, 
             fill= factor(hier_diana))) +
  geom_boxplot() +
  facet_wrap(~pretty_name, ncol = 3, shrink = FALSE, scales = "free") +
  labs(title = 'Boxplots per variable per cluster') +
  xlab("Cluster") + 
  ylab("Value") +
  theme(legend.position = "none")

Describing clusters with so many variables might be complicated, the first insight that is very easily visible is that cluster 9 has the highest IQR across most of the dimensions. Clusters between 9 and 13 are very similar to each other which might suggest that 13 clusters might sort of overfits the data.

pol_ds_long <-
    pol_ds %>% 
    select(-year, -kmeans, -pam,
           -clara, -hier_ward_D2) %>% 
    select(-contains("v2"), hier_diana) %>% 
    pivot_longer(cols = c(-hier_diana)) %>% 
    inner_join(ipe_mapping, by = c("name" = "variable"))
 
pol_ds_long %>% 
  ggplot(aes(x = factor(hier_diana), 
             y = value, 
             fill= factor(hier_diana))) +
  geom_boxplot() +
  facet_wrap(~pretty_name, ncol = 5, shrink = FALSE, scales = "free") +
  labs(title = 'Boxplots per variable per cluster') +
  xlab("Cluster") + 
  ylab("Value") +
  theme(legend.position = "none")

List of years for each cluster that can be used for evaluation of clustering quality using political economy knowledge.

pol_ds %>% 
  group_by(hier_diana) %>% 
  mutate(year = paste0(year, collapse=",")) %>% 
  slice(1) %>% 
  select(year) %>% 
  kable_head()
## Adding missing grouping variables: `hier_diana`
[1] "
hier_diana year
1 1919,1920,1921,1922,1923,1924,1925
2 1926,1927,1928,1929,1930,1931,1932,1933,1934
3 1935,1936,1937,1938
4 1949,1950,1951,1953,1954,1955,1956
5 1952
6 1957,1961,1965,1969,1972,1976,1980,1985
7 1958,1959,1960,1962,1963,1964,1966,1967,1968,1970,1971,1973,1974,1975,1977,1978,1979,1981,1982,1983,1984,1986,1987,1988
8 1989
9 1990,1995
10 1991,1992,1993,1994,1996,1997,1998,1999,2001
11 2000,2005,2010,2015
12 2002,2003,2004,2006,2007,2008,2009,2011,2012,2013,2014
13 2016,2017

"

Short description of each cluster by year and box plots variables:

  1. First cluster: observations that mostly correspond to the post-war period. Regulation of participation is very low and since it is the heart of democracy, it indicates that there are rules for when and how elections are expressed (Sandberg 2012).
  2. Second cluster: political developments in Poland, Piłsudski resigned from office, The May Coup resulted in overthrowing President Stanisław Wojciechowski and Prime Minister Wincenty Witos who were democratically elected. What definitely distinguishes this cluster is an elected executive index which not essentially should highly correspond to the decision what was the level of democracy, it can rather be used to aggregate higher-order indices. It tries to measure whether a chief executive was elected directly or indirectly. Of course, corresponding to the aforementioned events in Poland it is clear that this metric shall be rather low.
  3. Third cluster: before the second World war, the new Polish Constitution has been adopted, Piłsudski dies. What can be not very easily spotted here is the divided party control index. (Mayhew 1991)
  4. Fourth cluster: 7 important years for defense minister of Poland, Konstanty Rokossowski. Women political participation index value has a large interquartile range which can arise from Stanilist employment policies that aimed at the equality of sexes (Fidelis 2004).
  5. Fifth cluster: a year that mainly has been affected in the politcs by nearly 50,000 political prisoners that were being held in prison that is also visible on the boxplot of the Physical violence index is very low which indicates that citizens were not really safe.
  6. Sixth cluster: one of the most heterogeneous cluster in term of years, 4 different decades falls into this group of observations. All dimensions despite the corruption related and regime durability have very low IQR.
  7. Seventh cluster: second very heterogeneous cluster ,the highest number number of observations belongs here. Similar to cluster sixth with medians slightly below medians of cluster sixth.
  8. Eighth cluster: Round Table Talks that lead to political and economical reforms. Since it is only one observation boxplots do not bring a lot of value here. It can be described as some vertical line that divides polish history into two separate periods which is essentially true.
  9. Ninth cluster: in 1990 Lech Wałęsa won the presidential election, in 1995 Wałęsa demanded Pawlak’s resignation in January 1995. Cluster that has the highest IQR across most of the dimensions.
  10. Tenth cluster: times after Round Table Talks excluding presidential elections. The deliberative democracy index has the highest value.
  11. Eleventh cluster: presidential elections years. This is the only group of observations where the political competition has different values of quartiles.
  12. Twelfth cluster: Lech Wałęsa and Lech Kaczyński presidency. We can clearly see that egalitarian component index shows here that the resources inside society has been distributed equally in society.
  13. Thirteenth cluster: Andrzej Duda presidency and PiS single-party government. Even though women political participation index showed very high values.

Some of the other clustering methods

Gaussian mixture models

The main idea that is hidden in this algorithm is that points belong to several Gaussian distributions. The main advantage is that there is a proper mathematical theory developed inside which can lead to very realistic probability of belonging. Considering drawbacks we cannot omit is that many parameters need to be specified for these algorithms.

Density based algorithms

The main idea behind these kind of algorithms is that cluster is assigned based on the points density distribution of data points. DBSCAN algorithm is the most widely discussed algorithm that takes into account two parameters. The first one is the radius of the neighborhood. The second one is the minimum number of points that belong to the neighborhood.

Model Based Clustering

The most important part of these algorithms is that for each cluster different, the best model is fitted. Two main kinds of model-based clustering algorithms can be distinguished which are based on a statistical learning method (COBWEB, GMM) and the other based on the neural network learning method (SOM and ART).

Dimensionality Reduction

The idea that is behind dimensionality reduction is to present a dataset by fewer dimensions in order to make it more interpretable, easier to visualize, less computationally heavy. Having a dataset with a lot of variables they might be correlated with each other and a variable can be presented as a combination of other features. What is important new features should not be correlated between themselves!

The most common approach for reducing a number of features is the following: 1. Decide on data standardization. 2. Compute covariance matrix with all possible pairs. 3. Based on point 2 result compute eigenvectors and eigenvalues for PCA. 4. Find top eigenvectors and eigenvalues. 5. New principal components are variables as a constraint of the variables with the highest variance. 6. Develop new features with columns as eigenvectors. 7. Recast data along new principal axes.

Checking correlations

pol_ds_scaled_cor <- cor(pol_ds_scaled, method="pearson")
corrplot(pol_ds_scaled_cor, order ="alphabet", tl.cex=0.6)

Although there a lot of variables on the plot we can clearly see that a lot of variables are correlated either negatively or positively.

MDS reduction

As our data relates to social sciences we can definitely start with multidimensional scaling. Starting with definition, this is a nonlinear statistical technique that originated in psychometrics. MDS keeps the distance between the points and reduces a number of dimensions. Distances allow to preserve a pattern and clusters. The metrics used in this methods, called proximity, describes there is a number that indicates how similar two objects are or are perceived to be.

Classification of MDS

In order to describe scaling models they can be broadly classified into metric vs non-metric and strain (classical scaling) vs stress (distance scaling) based MDS models.

Applications

As MDS is one of the most widely used MDS techniques in data analysis of course it is not possible to provide a complete set of its uses. They find several applications in image processing or pre-process for algorithms that rely on Euclidean distances.

dist_pol_ds_scaled<-dist(t(pol_ds_scaled)) # as input we need distance between units
mds3<-cmdscale(dist_pol_ds_scaled, k=2) #k - the maximum dimension of the space
plot(mds3, type='n') # plot with labels
pointLabel(mds3, labels = pol_ds %>% mutate(year = as.character(year)) %>% pull(year), cex=0.6, adj=0.5)

Using smacof package allows for different measurement scales of data such as ratio, interval, oridinal and mspline.

dis2<-sim2diss(pol_ds_scaled_cor, method=1, to.dist=TRUE)
fit.data<-mds(dis2, ndim=2,  type="ratio") 
plot(fit.data, pch=21, cex=as.numeric(fit.data$spp), bg="red")

Measuring goodness of MDS

In order to measure how good is MDS functionality stress function can be taken into account.

fit.data$stress

[1] 0.1027908

Proposes values have been proposed for measuring the goodness of the fit. 0.1 which is our case provides information that this is a fair result. This value should be minimized. For this different methods can be analyzed.

dis2<-sim2diss(pol_ds_scaled_cor, method=1, to.dist=TRUE)
fit.data<-mds(dis2, ndim=2,  type="ordinal") # from smacof::
fit.data$stress

[1] 0.08331801

The smallest value is yield by an ordinal method which is non-metric MDS type where the only thing that is important is the order relations between the dissimilarities.

Principal Component Analysis

xxx.pca1<-prcomp(pol_ds_scaled, center=FALSE, scale.=FALSE) 

For explaining 99% of the variance we need to take only the first two Principal Components.

xxx.pca2<-princomp(pol_ds_scaled) # stats::princomp()
plot(xxx.pca2)# the same will be plot(xxx.pca1)

fviz_pca_var(xxx.pca1, col.var="steelblue")

Based on this plot the following intuition can be achieved: parreg_P4 is oppositely correlated to the v2x_divparctrl_VDEM.

fviz_eig(xxx.pca1, choice='eigenvalue') # eigenvalues on y-axis

fviz_eig(xxx.pca1) # percentage of explained variance on y-axis

var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')

Rotated PCA allows easier interpretation of factors used in analysis due to the change in the structure of components. PCA rotates components across the axis of the factors. The most commonly used option is varimax which minimizes the number of variables needed to explain the given factors.

xxx.pca4<-principal(pol_ds_scaled, nfactors=2, rotate="varimax")
summary(xxx.pca4)

Factor analysis with Call: principal(r = pol_ds_scaled, nfactors = 2, rotate = “varimax”)

Test of the hypothesis that 2 factors are sufficient. The degrees of freedom for the model is 1709 and the objective function was 206.88 The number of observations was 89 with Chi Square = 13550.72 with prob < 0

The root mean square of the residuals (RMSA) is 0.04

plot(xxx.pca1)

plot(xxx.pca1, type = "l")

fviz_eig(xxx.pca1)

Uniqueness and complexity

Uniqueness is the relation of variance according to share in other variables. For PCA we want it to be low, then we know that the variable is not correlated with other variables in the model. This is similar to multicollinearity.

Complexity corresponds to the number of factor loads that take values greater than zero. Given the relatively large loading of one factor only the complexity is near to 1. To paraphrase, we want to know what is the overall contribution of all factors to the single variable. We want to keep our contribution low because it involves a more difficult interpretation of factors.

set<-data.frame(complex=xxx.pca4$complexity, unique=xxx.pca4$uniqueness)
set.worst<-set[set$complex>1.8 & set$unique>0.78,]
set.worst %>% kable_head
[1] "
complex unique
v2xel_elecpres_VDEM 1.927795 0.8468625
parreg_P4 1.954955 0.8248794

"

Combining complexity and uniqueness together we can spot variables that may lead to problems.

fviz_pca_ind(xxx.pca1, col.ind="cos2", geom="point", gradient.cols=c("white", "#2E9FDF", "#FC4E07" ))

Summary

The problem of estimating a number of clusters in a dataset can be quite hard which then results in yielding unexpected results. Domain experts may claim that 13 clusters in the dataset with 89 observations is surprisingly a lot. On the other hand, the so-called gap statistics have been used along with cluster validation package that can help with tasks that are performed without external knowledge.

The data set was very useful in understanding problems of International Political Economy. The results of clustering are very promising in terms of their proposed definition that suggests that the polish history can be very appealing topic to know better taking into account its variability.

Bibliography

Brock, Guy, Vasyl Pihur, Susmita Datta, and Somnath Datta. 2008. “ClValid: An R Package for Cluster Validation.” Journal of Statistical Software, Articles 25 (4): 1–22. https://doi.org/10.18637/jss.v025.i04.

Fidelis, Malgorzata. 2004. “Equality Through Protection: The Politics of Women’s Employment in Postwar Poland, 1945-1956.” Slavic Review 63 (2): 301–24. http://www.jstor.org/stable/3185730.

Graham, Benjamin A. T., and Jacob R. Tucker. 2016. “The International Political Economy Data Resource.” Harvard Dataverse. https://doi.org/10.7910/DVN/X093TV.

Mayhew, David R. 1991. “Divided Party Control: Does It Make a Difference?” PS: Political Science and Politics 24 (4): 637–40. http://www.jstor.org/stable/419392.

Sandberg, Per, Mikael AND Lundberg. 2012. “Political Institutions and Their Historical Dynamics.” PLOS ONE 7 (10): 1–10. https://doi.org/10.1371/journal.pone.0045838.

Spence, J. E. 1995. “Two worlds of international relations: academics, practitioners and the trade in ideas.” International Affairs 71 (2): 359–60. https://doi.org/10.2307/2623442.

Staniak, Mateusz, and Przemysław Biecek. 2019. “The Landscape of R Packages for Automated Exploratory Data Analysis.” The R Journal 11 (2): 347. https://doi.org/10.32614/rj-2019-033.

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. 2001. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” Journal of the Royal Statistical Society Series B 63 (February): 411–23. https://doi.org/10.1111/1467-9868.00293.