Definition denotes International Political Economy as the field that focuses on a particular range of questions and a series of assumptions about the nature of the international system and how it is understood. The critical part of its concentrate on a set of questions. Four main aspects of it concentrate on: politics, economics, international politics, and international economics. International Political Economy is apparently linked to foreign policy that impact of IPE is difficult to assess (Spence 1995).
In order to provide master dataset (Graham and Tucker 2016) that can be used by researchers from the international political economy 89 data resources have been merged. Observations are identified by country-year the unit of analysis. Countries are identified by Gleditsch-Ward number in an alternative version by Correlates of War (COW). Most of the dataset components begin after Second World War. Each of 89 datasets has been given a unique suffix that uniquely identifies it.
Firstly data is loaded, then it is converted to tibble as tidyverse will be mainly used in this project.
load(paste0("data/", "master_ipe_v4.rdata"))
ipe_mapping <- xlsx::read.xlsx2("data\\ipe_mapping.xlsx", sheetIndex = 1)
ipe_v4 <- as_tibble(ipe_v4)
Dimensions of the dataset need to be investigated as well.
ipe_v4 %>% dim()
[1] 25850 1043
A master table consists of 25850 rows and 1043 variables. Each of them is described in the codebook provided by library maintainers. They are divided into following groups:
Very good quality of the data is provided by Varieties of Democracy Dataset (VDEM) inside political section, it will definitely be included not only because of but also its popularity and recommendation during classes.
The other dataset chosen by me is going to be “Polity IV Democracy”.
pol_ds <-
ipe_v4 %>%
filter(country == "Poland", year > 1918,
!(year %in% c(1939, 1940, 1941, 1942, 1943, 1944, 1945, 2020))) %>%
select(year, starts_with("v2"), contains("P4"),
-c(change_P4, sf_P4, fragment_P4, regtrans_P4, countryname_raw_P4,
v2x_hosinter_VDEM, v2x_suffr_VDEM, v2elsrgel_VDEM, v2elreggov_VDEM,
v2xlg_leginter_VDEM, v2x_elecreg_VDEM, v2xlg_elecreg_VDEM))
In every modeling process, we should perform exploratory data analysis before moving to the next points. Thanks to the broad landscape of autoEDA packages (Staniak and Biecek 2019) we can very easily generate some intuition towards the analyzed data. The chosen subset of data is associated with the brilliant starting point for analysis having only two columns, v2x_gender_VDEM and v2x_genpp_VDEM, with a missing value. The graphs show the distribution for each variable. The year column shows that there is no data between the years 1939-1948. Most of the variables there are so-called indexes with a range between 0 and 1. For P4 variables the minimum value can be lower than zero. Durable_P4 variable is highly skewed. For v2xcl_rol_VDEM we can clearly see an opposite pattern, however there is a missing range of numbers for which we have little to no observations.
dfSummary(pol_ds, plain.ascii = FALSE, style = "grid",
graph.magnif = 0.75, valid.col = FALSE, tmp.img.dir = "/tmp")
## temporary images written to 'E:\tmp'
pol_ds
Dimensions: 89 x 61
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
|---|---|---|---|---|---|
| 1 | year [numeric] |
Mean (sd) : 1970.8 (29) min < med < max: 1919 < 1973 < 2017 IQR (CV) : 44 (0) |
89 distinct values | 0 (0.0%) |
|
| 2 | v2x_polyarchy_VDEM [numeric] |
Mean (sd) : 0.4 (0.3) min < med < max: 0.2 < 0.3 < 0.9 IQR (CV) : 0.7 (0.7) |
54 distinct values | 0 (0.0%) |
|
| 3 | v2x_api_VDEM [numeric] |
Mean (sd) : 0.6 (0.3) min < med < max: 0.3 < 0.5 < 1 IQR (CV) : 0.6 (0.4) |
53 distinct values | 0 (0.0%) |
|
| 4 | v2x_mpi_VDEM [numeric] |
Mean (sd) : 0.3 (0.4) min < med < max: 0 < 0 < 0.9 IQR (CV) : 0.8 (1.3) |
38 distinct values | 0 (0.0%) |
|
| 5 | v2x_EDcomp_thick_VDEM [numeric] |
Mean (sd) : 0.5 (0.3) min < med < max: 0.2 < 0.3 < 0.9 IQR (CV) : 0.6 (0.5) |
55 distinct values | 0 (0.0%) |
|
| 6 | v2x_libdem_VDEM [numeric] |
Mean (sd) : 0.4 (0.3) min < med < max: 0.1 < 0.2 < 0.9 IQR (CV) : 0.6 (0.8) |
58 distinct values | 0 (0.0%) |
|
| 7 | v2x_liberal_VDEM [numeric] |
Mean (sd) : 0.6 (0.2) min < med < max: 0.3 < 0.6 < 1 IQR (CV) : 0.5 (0.4) |
57 distinct values | 0 (0.0%) |
|
| 8 | v2x_partipdem_VDEM [numeric] |
Mean (sd) : 0.3 (0.3) min < med < max: 0 < 0.1 < 0.7 IQR (CV) : 0.5 (1) |
57 distinct values | 0 (0.0%) |
|
| 9 | v2x_partip_VDEM [numeric] |
Mean (sd) : 0.3 (0.2) min < med < max: 0.1 < 0.3 < 0.7 IQR (CV) : 0.5 (0.8) |
39 distinct values | 0 (0.0%) |
|
| 10 | v2x_delibdem_VDEM [numeric] |
Mean (sd) : 0.3 (0.3) min < med < max: 0 < 0.2 < 0.8 IQR (CV) : 0.6 (1) |
58 distinct values | 0 (0.0%) |
|
| 11 | v2xdl_delib_VDEM [numeric] |
Mean (sd) : 0.4 (0.3) min < med < max: 0 < 0.4 < 0.9 IQR (CV) : 0.7 (0.7) |
42 distinct values | 0 (0.0%) |
|
| 12 | v2x_egaldem_VDEM [numeric] |
Mean (sd) : 0.4 (0.2) min < med < max: 0.2 < 0.2 < 0.8 IQR (CV) : 0.5 (0.6) |
52 distinct values | 0 (0.0%) |
|
| 13 | v2x_egal_VDEM [numeric] |
Mean (sd) : 0.8 (0.1) min < med < max: 0.6 < 0.8 < 0.9 IQR (CV) : 0.1 (0.1) |
28 distinct values | 0 (0.0%) |
|
| 14 | v2x_frassoc_thick_VDEM [numeric] |
Mean (sd) : 0.5 (0.4) min < med < max: 0.1 < 0.4 < 0.9 IQR (CV) : 0.8 (0.8) |
52 distinct values | 0 (0.0%) |
|
| 15 | v2x_freexp_altinf_VDEM [numeric] |
Mean (sd) : 0.5 (0.4) min < med < max: 0.1 < 0.6 < 1 IQR (CV) : 0.8 (0.7) |
37 distinct values | 0 (0.0%) |
|
| 16 | v2xme_altinf_VDEM [numeric] |
Mean (sd) : 0.5 (0.4) min < med < max: 0.1 < 0.5 < 0.9 IQR (CV) : 0.8 (0.8) |
20 distinct values | 0 (0.0%) |
|
| 17 | v2xel_frefair_VDEM [numeric] |
Mean (sd) : 0.5 (0.3) min < med < max: 0.2 < 0.3 < 1 IQR (CV) : 0.7 (0.6) |
51 distinct values | 0 (0.0%) |
|
| 18 | v2x_elecoff_VDEM [numeric] |
Mean (sd) : 0.9 (0.3) min < med < max: 0 < 1 < 1 IQR (CV) : 0 (0.3) |
0.00 : 9 (10.1%) 0.83 : 4 ( 4.5%) 1.00 : 76 (85.4%) |
0 (0.0%) |
|
| 19 | v2xcl_rol_VDEM [numeric] |
Mean (sd) : 0.7 (0.2) min < med < max: 0.3 < 0.8 < 1 IQR (CV) : 0.5 (0.3) |
46 distinct values | 0 (0.0%) |
|
| 20 | v2x_jucon_VDEM [numeric] |
Mean (sd) : 0.7 (0.2) min < med < max: 0.4 < 0.7 < 1 IQR (CV) : 0.4 (0.3) |
25 distinct values | 0 (0.0%) |
|
| 21 | v2xlg_legcon_VDEM [numeric] |
Mean (sd) : 0.5 (0.3) min < med < max: 0.1 < 0.3 < 0.9 IQR (CV) : 0.7 (0.7) |
27 distinct values | 0 (0.0%) |
|
| 22 | v2x_cspart_VDEM [numeric] |
Mean (sd) : 0.5 (0.3) min < med < max: 0.1 < 0.4 < 0.9 IQR (CV) : 0.7 (0.7) |
29 distinct values | 0 (0.0%) |
|
| 23 | v2xdd_dd_VDEM [numeric] |
Mean (sd) : 0.1 (0.1) min < med < max: 0 < 0 < 0.2 IQR (CV) : 0.1 (1.5) |
14 distinct values | 0 (0.0%) |
|
| 24 | v2xel_locelec_VDEM [numeric] |
Mean (sd) : 0.4 (0.4) min < med < max: 0 < 0.3 < 1 IQR (CV) : 0.9 (1) |
14 distinct values | 0 (0.0%) |
|
| 25 | v2xel_regelec_VDEM [numeric] |
Mean (sd) : 0.2 (0.3) min < med < max: 0 < 0.1 < 0.9 IQR (CV) : 0.2 (1.4) |
19 distinct values | 0 (0.0%) |
|
| 26 | v2xeg_eqprotec_VDEM [numeric] |
Mean (sd) : 0.7 (0.1) min < med < max: 0.6 < 0.7 < 0.9 IQR (CV) : 0.1 (0.2) |
12 distinct values | 0 (0.0%) |
|
| 27 | v2xeg_eqdr_VDEM [numeric] |
Mean (sd) : 0.8 (0.1) min < med < max: 0.5 < 0.9 < 0.9 IQR (CV) : 0 (0.2) |
22 distinct values | 0 (0.0%) |
|
| 28 | v2xcs_ccsi_VDEM [numeric] |
Mean (sd) : 0.5 (0.3) min < med < max: 0.1 < 0.4 < 0.9 IQR (CV) : 0.7 (0.6) |
29 distinct values | 0 (0.0%) |
|
| 29 | v2xps_party_VDEM [numeric] |
Mean (sd) : 0.8 (0.1) min < med < max: 0.6 < 0.7 < 0.9 IQR (CV) : 0.2 (0.1) |
25 distinct values | 0 (0.0%) |
|
| 30 | v2x_gender_VDEM [numeric] |
Mean (sd) : 0.7 (0.1) min < med < max: 0.5 < 0.7 < 1 IQR (CV) : 0.3 (0.2) |
49 distinct values | 1 (1.1%) |
|
| 31 | v2x_gencl_VDEM [numeric] |
Mean (sd) : 0.7 (0.2) min < med < max: 0.4 < 0.8 < 0.9 IQR (CV) : 0.3 (0.2) |
28 distinct values | 0 (0.0%) |
|
| 32 | v2x_gencs_VDEM [numeric] |
Mean (sd) : 0.6 (0.2) min < med < max: 0.5 < 0.5 < 0.9 IQR (CV) : 0.4 (0.3) |
30 distinct values | 0 (0.0%) |
|
| 33 | v2x_genpp_VDEM [numeric] |
Mean (sd) : 0.8 (0.2) min < med < max: 0.5 < 0.9 < 1 IQR (CV) : 0.3 (0.2) |
28 distinct values | 1 (1.1%) |
|
| 34 | v2xex_elecreg_VDEM [integer] |
Min : 0 Mean : 0.3 Max : 1 |
0 : 61 (68.5%) 1 : 28 (31.5%) |
0 (0.0%) |
|
| 35 | v2xel_elecparl_VDEM [integer] |
Min : 0 Mean : 0.3 Max : 1 |
0 : 65 (73.0%) 1 : 24 (27.0%) |
0 (0.0%) |
|
| 36 | v2xel_elecpres_VDEM [integer] |
Min : 0 Mean : 0.1 Max : 1 |
0 : 83 (93.3%) 1 : 6 ( 6.7%) |
0 (0.0%) |
|
| 37 | v2x_corr_VDEM [numeric] |
Mean (sd) : 0.2 (0) min < med < max: 0.1 < 0.2 < 0.2 IQR (CV) : 0.1 (0.3) |
31 distinct values | 0 (0.0%) |
|
| 38 | v2x_pubcorr_VDEM [numeric] |
Mean (sd) : 0.2 (0.1) min < med < max: 0.1 < 0.2 < 0.4 IQR (CV) : 0.2 (0.3) |
23 distinct values | 0 (0.0%) |
|
| 39 | v2x_execorr_VDEM [numeric] |
Mean (sd) : 0.1 (0) min < med < max: 0.1 < 0.2 < 0.2 IQR (CV) : 0.1 (0.3) |
21 distinct values | 0 (0.0%) |
|
| 40 | v2x_divparctrl_VDEM [numeric] |
Mean (sd) : -0.4 (1.1) min < med < max: -1.6 < -0.3 < 1.3 IQR (CV) : 2.2 (-2.6) |
23 distinct values | 0 (0.0%) |
|
| 41 | v2x_feduni_VDEM [numeric] |
Mean (sd) : 0.3 (0.4) min < med < max: 0 < 0.1 < 0.9 IQR (CV) : 0.4 (1.1) |
19 distinct values | 0 (0.0%) |
|
| 42 | v2x_civlib_VDEM [numeric] |
Mean (sd) : 0.7 (0.2) min < med < max: 0.3 < 0.7 < 1 IQR (CV) : 0.5 (0.4) |
54 distinct values | 0 (0.0%) |
|
| 43 | v2x_clphy_VDEM [numeric] |
Mean (sd) : 0.7 (0.2) min < med < max: 0.3 < 0.8 < 1 IQR (CV) : 0.4 (0.3) |
27 distinct values | 0 (0.0%) |
|
| 44 | v2x_clpol_VDEM [numeric] |
Mean (sd) : 0.5 (0.3) min < med < max: 0.1 < 0.6 < 1 IQR (CV) : 0.8 (0.6) |
43 distinct values | 0 (0.0%) |
|
| 45 | v2x_clpriv_VDEM [numeric] |
Mean (sd) : 0.7 (0.2) min < med < max: 0.3 < 0.8 < 0.9 IQR (CV) : 0.3 (0.3) |
37 distinct values | 0 (0.0%) |
|
| 46 | v2cltrnslw_VDEM [numeric] |
Mean (sd) : 1.1 (0.7) min < med < max: 0.1 < 1.2 < 2.1 IQR (CV) : 1.1 (0.6) |
14 distinct values | 0 (0.0%) |
|
| 47 | v2juhcind_VDEM [numeric] |
Mean (sd) : 0.9 (1.1) min < med < max: -0.6 < 0.8 < 2.6 IQR (CV) : 2.6 (1.2) |
19 distinct values | 0 (0.0%) |
|
| 48 | v2juhccomp_VDEM [numeric] |
Mean (sd) : 1.3 (0.6) min < med < max: 0.2 < 1 < 2.2 IQR (CV) : 1 (0.4) |
14 distinct values | 0 (0.0%) |
|
| 49 | v2excrptps_VDEM [numeric] |
Mean (sd) : 0.7 (0.3) min < med < max: 0.2 < 0.9 < 1.3 IQR (CV) : 0.6 (0.5) |
18 distinct values | 0 (0.0%) |
|
| 50 | v2jucorrdc_VDEM [numeric] |
Mean (sd) : 1.9 (0.5) min < med < max: 1.3 < 2.2 < 2.5 IQR (CV) : 1.1 (0.3) |
9 distinct values | 0 (0.0%) |
|
| 51 | v2exbribe_VDEM [numeric] |
Mean (sd) : 1.4 (0.4) min < med < max: 0.9 < 1.3 < 2.3 IQR (CV) : 0.8 (0.3) |
18 distinct values | 0 (0.0%) |
|
| 52 | v2lgcrrpt_VDEM [numeric] |
Mean (sd) : 0.8 (0.2) min < med < max: 0.2 < 0.8 < 1.4 IQR (CV) : 0.2 (0.3) |
13 distinct values | 0 (0.0%) |
|
| 53 | democ_P4 [numeric] |
Mean (sd) : 3.8 (4.4) min < med < max: 0 < 2 < 10 IQR (CV) : 9 (1.1) |
0 : 44 (49.4%) 2 : 9 (10.1%) 5 : 2 ( 2.2%) 8 : 11 (12.4%) 9 : 7 ( 7.9%) 10 : 16 (18.0%) |
0 (0.0%) |
|
| 54 | autoc_P4 [numeric] |
Mean (sd) : 3.9 (3.3) min < med < max: 0 < 5 < 8 IQR (CV) : 7 (0.8) |
0 : 36 (40.4%) 5 : 9 (10.1%) 6 : 7 ( 7.9%) 7 : 35 (39.3%) 8 : 2 ( 2.2%) |
0 (0.0%) |
|
| 55 | polity_P4 [numeric] |
Mean (sd) : -0.1 (7.6) min < med < max: -8 < -3 < 10 IQR (CV) : 16 (-75.3) |
-8 : 2 ( 2.2%) -7 : 35 (39.3%) -6 : 7 ( 7.9%) -3 : 9 (10.1%) 5 : 2 ( 2.2%) 8 : 11 (12.4%) 9 : 7 ( 7.9%) 10 : 16 (18.0%) |
0 (0.0%) |
|
| 56 | polity2_P4 [numeric] |
Mean (sd) : -0.1 (7.6) min < med < max: -8 < -3 < 10 IQR (CV) : 16 (-75.3) |
-8 : 2 ( 2.2%) -7 : 35 (39.3%) -6 : 7 ( 7.9%) -3 : 9 (10.1%) 5 : 2 ( 2.2%) 8 : 11 (12.4%) 9 : 7 ( 7.9%) 10 : 16 (18.0%) |
0 (0.0%) |
|
| 57 | durable_P4 [numeric] |
Mean (sd) : 14.4 (11.6) min < med < max: 0 < 12 < 41 IQR (CV) : 19 (0.8) |
42 distinct values | 0 (0.0%) |
|
| 58 | xconst_P4 [numeric] |
Mean (sd) : 4.7 (1.9) min < med < max: 2 < 4 < 7 IQR (CV) : 4 (0.4) |
2 : 2 ( 2.2%) 3 : 42 (47.2%) 4 : 2 ( 2.2%) 5 : 9 (10.1%) 6 : 4 ( 4.5%) 7 : 30 (33.7%) |
0 (0.0%) |
|
| 59 | parreg_P4 [numeric] |
Mean (sd) : 3.7 (1) min < med < max: 2 < 4 < 5 IQR (CV) : 0 (0.3) |
2 : 20 (22.5%) 4 : 53 (59.6%) 5 : 16 (18.0%) |
0 (0.0%) |
|
| 60 | parcomp_P4 [numeric] |
Mean (sd) : 2.5 (1.6) min < med < max: 1 < 2 < 5 IQR (CV) : 3 (0.6) |
1 : 37 (41.6%) 2 : 16 (18.0%) 3 : 7 ( 7.9%) 4 : 13 (14.6%) 5 : 16 (18.0%) |
0 (0.0%) |
|
| 61 | polcomp_P4 [numeric] |
Mean (sd) : 4.4 (3.9) min < med < max: 1 < 2 < 10 IQR (CV) : 8 (0.9) |
1 : 37 (41.6%) 2 : 16 (18.0%) 7 : 7 ( 7.9%) 9 : 13 (14.6%) 10 : 16 (18.0%) |
0 (0.0%) |
Clustering is an unsupervised learning technique that uses machine learning algorithms to assign similar data to groups. Mostly it is used for knowledge discovery of hidden patterns found within data. The main idea is to cluster similar observations so that they are grouped together. The more diverse groups the better. In an ideal world, the results obtained from clustering should not only have good statistical properties (compact, well-separated, connected and stable), but also yield output that is relevant in terms of processed data.
There is a need to scale data before a further analysis. What is more NAs should be filled with chosen technique.
pol_ds_scaled <- pol_ds %>% fill(v2x_gender_VDEM, v2x_genpp_VDEM) %>% scale %>% as_tibble()
The first step that this analysis will cover is the clusterability of the dataset. Agnes function provides the data scientist with the agglomerative coefficient that measures the amount of clustering structure found plus an easy and novel way to visualize the banner. A coefficient that is close to one suggests the high clustering structure found in the dataset.
hc2 <- agnes(pol_ds_scaled)
print(hc2$ac)
[1] 0.9184494
The value shows that chosen dataset is a very good starting point for the clustering analysis.
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(pol_ds_scaled, method = x)$ac
}
map_dbl(m, ac)
average single complete ward 0.9184494 0.8827847 0.9333211 0.9850385
Other methods also yield very promising results.
In order to keep high heterogeneity between the clusters and low inside them an appropriate number of clusters should be chosen. The very simple method is defined as the square root of (n/2) where n is denoted as an optimal number of clusters.
sqrt((pol_ds_scaled %>% nrow()) / 2)
[1] 6.670832
Now we are coming back to the elbow method.
opt <- Optimal_Clusters_KMeans(pol_ds_scaled, max_clusters=10, plot_clusters = TRUE)
Elbow method and showed within-group heterogeneity shows that 6 clusters are an appropriate number of clusters, others may argue that 3 is the starting point of the linear trend decrease.
opt <- Optimal_Clusters_KMeans(pol_ds_scaled,
max_clusters=10,
'euclidean',
plot_clusters = TRUE,
criterion = 'silhouette')
The other method uses silhouette index to find the best number of groups inside the data set which shows the maximum value for 2 clusters and lowest for 6. Measure method takes into account what is the difference between points that belong to the same cluster and what is the difference between different clusters.
The methods presented above are not very formalized due to their graphical way of presenting results. One way to address this drawback is a statistical procedure called gap statistics (Tibshirani, Walther, and Hastie 2001). The concept is based on the graph of log(W_k) by making a comparison with its expectation under an appropriate null reference distribution of the data. For the latter one appropriate reference distribution needs to be found in the data. Then the estimation of the optimal number of clusters is the value of k for which log(W_k) falls the farthest below this reference curve. One of the assumptions of this algorithm is to have data that can be separated well.
clusGapRes <- clusGap(pol_ds_scaled, FUN = kmeans, nstart = 20, K.max = 20, B = 60)
The number of clusters proposed by this sophisticated method is equal to 13 which is presented by “Number of clusters (method ‘firstSEmax’, SE.factor=1): 13”. Mainly due to its advanced methodology, this number has been chosen in further analysis.
chosen_size <- 13
Clustering based on partition assumes that cluster center is to regard the center of data points. The main pros of these kind of algorithms are their time complexity and their ability to compute results very well in general. In terms of drawbacks we can list the following points:
K-means is probably the most widely used algorithm for unsupervised learning. In the algorithm, we shall specify a number of clusters. A goal is pretty straightforward, to minimize the differences within-cluster and maximize the differences between clusters. The way the algorithm works are as follows, each of the points is assigned to the cluster, then it updates the assignments by adjusting cluster boundaries according to the examples that currently fall into the cluster. Points are assigned to the clusters based on a chosen distance metric. Then centers are recalculated. The process lasts until no further improvements can be made in the process. Improvement to this algorithm is K-medoids which also deal with discrete data points, as the representative of the corresponding cluster.
pol_clusters <- kmeans(pol_ds_scaled %>% as.matrix(), chosen_size)
pol_ds$kmeans <- pol_clusters$cluster
fviz_cluster(pol_clusters, data = pol_ds_scaled)
One of the metric for k-means clustering can be checking what is the ratio between cluster sum of squares and the total sum of squares:
(pol_clusters$betweenss /pol_clusters$totss)*100
[1] 94.50958
94.50958 is the total variance in the data that is explained by the clustering. Thanks to clustering the algorithm is able to reduce sum of squares to 94.5%.
sil <- silhouette(x = pol_ds$kmeans, dist(pol_ds_scaled))
fviz_silhouette(sil)
cluster size ave.sil.width 1 1 7 0.73 2 2 4 0.22 3 3 4 0.25 4 4 3 0.19 5 5 4 0.63 6 6 9 0.51 7 7 8 0.41 8 8 12 0.28 9 9 1 0.00 10 10 10 0.57 11 11 4 0.60 12 12 7 0.69 13 13 16 0.37
PAM focuses on objects that are centrally located in a cluster. The procedure chooses set of medoids and iteratively replace one of the medoids with one of the non-medoids and check if it improves the total distance of the resulting clustering. The algorithm stops when there is no further change.
c1<-pam(pol_ds_scaled, chosen_size)
pol_ds$pam <- c1$clustering
fviz_cluster(c1, geom="point", ellipse.type="norm")
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
CLARA draws multiple samples of the dataset and then applies PAM on each sample. Therefore the main weakness of the PAM procedure is omitted.
clara_flex <- eclust(pol_ds_scaled %>% as.matrix(), "clara", k=chosen_size)
pol_ds$clara <- clara_flex$clustering
After assigning we can easily show counts per cluster.
fviz_cluster(clara_flex, geom="point", ellipse.type="norm")
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
The main advantage of hierarchical clustering over k-means is that there is no need to implicitly specify the number of clusters. The algorithm itself proceeds iteratively, until there is just a single cluster. The algorithm is very often visualized via dendogram which is a tree structure. Two main types of hierarchical clustering are widely discussed which are agglomerative [bottom up] and divisive [top down]. These algorithms work well with arbitrary shapes and types. The algorithms are often able to detect hierarchical relationships which can be useful in many disciplines such as biology or economics. Unfortunately, these types of algorithms do not work efficiently with large data sets.
hc3 <- agnes(pol_ds_scaled, method = "ward")
pltree(hc3, cex = 0.6, hang = -1, main = "dendrogram - agnes")
In order to perform hierarchical clustering euclidean distance is calculated first then we cut the dendogram tree at k = 6. The chosen method is Ward’s method which is usually fine by default.
d_pol_ds <- dist(pol_ds_scaled, method = "euclidean")
hc5 <- hclust(d_pol_ds, method = "ward.D2" )
sub_grp <- cutree(hc5, k = chosen_size)
pol_ds <-
pol_ds %>%
mutate(hier_ward_D2 = sub_grp)
fviz_cluster(list(data = pol_ds_scaled, cluster = pol_ds$hier_ward_D2))
hc4 <- diana(pol_ds_scaled)
pltree(hc4, cex = 0.6, hang = -1, main = "dendrogram - diana")
sub_grp <- cutree(hc4, k = chosen_size)
pol_ds <-
pol_ds %>%
mutate(hier_diana = sub_grp)
In order to perform hierarchical clustering euclidean distance is calculated first then we cut the dendogram tree at k = 6. The chosen method is Ward’s method which is usually fine by default.
Having discussed mostly known methods other types of cluster validation can be discussed as well such as “internal” “stability” from clValid package (Brock et al. 2008). For internal measures selected measures are defined the compactness, connectedness, and separation of the cluster partitions. Connectedness is evaluated depending on the number of the nearest points that are assigned to the same cluster. Compactness relates to inner cluster homogeneity, very often being measured by looking at the intra-cluster variance. Distance between cluster centroids is often being measured when separation evaluation is presented. The non-linear combination of the compactness and separation is described by the Dunn index and silhouette width. Stability is evaluated in terms of consistency of clustering by comparing it with the clusters obtained after each column is removed, one at a time.
intern <- clValid(pol_ds_scaled %>% as.matrix(),
chosen_size,
clMethods =
c("hierarchical", "kmeans", "diana", "agnes", "pam",
"clara", "model"),
validation = "internal"
)
intern@measures %>%
as_tibble() %>%
pivot_longer(cols = everything()) %>%
separate(name, into = c("number", "model_name"), sep="[.]") %>%
select(model_name, value) %>%
mutate( measure_id = 0:20 %/% 7) %>%
inner_join(tibble(measure_id = 0:2, measure_name = c("Connectivity", "Dunn", "Silhouette"))) %>%
select(-measure_id) %>%
pivot_wider(names_from = measure_name, values_from = value) %>%
kable_head()
## Joining, by = "measure_id"
[1] "
| model_name | Connectivity | Dunn | Silhouette |
|---|---|---|---|
| hierarchical | 39.46508 | 0.4966301 | 0.4219525 |
| kmeans | 44.47103 | 0.1928479 | 0.4250111 |
| diana | 42.39206 | 0.4604724 | 0.4176306 |
| agnes | 39.46508 | 0.4966301 | 0.4219525 |
| pam | 50.07302 | 0.1536358 | 0.4156764 |
| clara | 54.58730 | 0.1536358 | 0.4130862 |
| model | 95.48611 | 0.0322389 | 0.1411220 |
"
The highest connectivity is associated with PAM, CLARA, and model clustering algorithms. It should be minimized so taking into account this technique the best algorithm is diana. Silhouette value is smallest for model-based and the maximum for diana algorithm. Based on this data we can assume diana as the best algorithm.
Based on per cluster results we can try to name formed clusters based on these values in terms of the international economy.
pol_ds_long <-
pol_ds %>%
select(-year, -kmeans, -pam,
-clara, -hier_ward_D2) %>%
select(starts_with("v2x_"), hier_diana) %>%
pivot_longer(cols = c(-hier_diana)) %>%
inner_join(ipe_mapping, by = c("name" = "variable"))
pol_ds_long %>%
ggplot(aes(x = factor(hier_diana),
y = value,
fill= factor(hier_diana))) +
geom_boxplot() +
facet_wrap(~pretty_name, ncol = 3, shrink = FALSE, scales = "free") +
labs(title = 'Boxplots per variable per cluster') +
xlab("Cluster") +
ylab("Value") +
theme(legend.position = "none")
Describing clusters with so many variables might be complicated, the first insight that is very easily visible is that cluster 9 has the highest IQR across most of the dimensions. Clusters between 9 and 13 are very similar to each other which might suggest that 13 clusters might sort of overfits the data.
pol_ds_long <-
pol_ds %>%
select(-year, -kmeans, -pam,
-clara, -hier_ward_D2) %>%
select(-contains("v2"), hier_diana) %>%
pivot_longer(cols = c(-hier_diana)) %>%
inner_join(ipe_mapping, by = c("name" = "variable"))
pol_ds_long %>%
ggplot(aes(x = factor(hier_diana),
y = value,
fill= factor(hier_diana))) +
geom_boxplot() +
facet_wrap(~pretty_name, ncol = 5, shrink = FALSE, scales = "free") +
labs(title = 'Boxplots per variable per cluster') +
xlab("Cluster") +
ylab("Value") +
theme(legend.position = "none")
List of years for each cluster that can be used for evaluation of clustering quality using political economy knowledge.
pol_ds %>%
group_by(hier_diana) %>%
mutate(year = paste0(year, collapse=",")) %>%
slice(1) %>%
select(year) %>%
kable_head()
## Adding missing grouping variables: `hier_diana`
[1] "
| hier_diana | year |
|---|---|
| 1 | 1919,1920,1921,1922,1923,1924,1925 |
| 2 | 1926,1927,1928,1929,1930,1931,1932,1933,1934 |
| 3 | 1935,1936,1937,1938 |
| 4 | 1949,1950,1951,1953,1954,1955,1956 |
| 5 | 1952 |
| 6 | 1957,1961,1965,1969,1972,1976,1980,1985 |
| 7 | 1958,1959,1960,1962,1963,1964,1966,1967,1968,1970,1971,1973,1974,1975,1977,1978,1979,1981,1982,1983,1984,1986,1987,1988 |
| 8 | 1989 |
| 9 | 1990,1995 |
| 10 | 1991,1992,1993,1994,1996,1997,1998,1999,2001 |
| 11 | 2000,2005,2010,2015 |
| 12 | 2002,2003,2004,2006,2007,2008,2009,2011,2012,2013,2014 |
| 13 | 2016,2017 |
"
Short description of each cluster by year and box plots variables:
The main idea that is hidden in this algorithm is that points belong to several Gaussian distributions. The main advantage is that there is a proper mathematical theory developed inside which can lead to very realistic probability of belonging. Considering drawbacks we cannot omit is that many parameters need to be specified for these algorithms.
The main idea behind these kind of algorithms is that cluster is assigned based on the points density distribution of data points. DBSCAN algorithm is the most widely discussed algorithm that takes into account two parameters. The first one is the radius of the neighborhood. The second one is the minimum number of points that belong to the neighborhood.
The most important part of these algorithms is that for each cluster different, the best model is fitted. Two main kinds of model-based clustering algorithms can be distinguished which are based on a statistical learning method (COBWEB, GMM) and the other based on the neural network learning method (SOM and ART).
The idea that is behind dimensionality reduction is to present a dataset by fewer dimensions in order to make it more interpretable, easier to visualize, less computationally heavy. Having a dataset with a lot of variables they might be correlated with each other and a variable can be presented as a combination of other features. What is important new features should not be correlated between themselves!
The most common approach for reducing a number of features is the following: 1. Decide on data standardization. 2. Compute covariance matrix with all possible pairs. 3. Based on point 2 result compute eigenvectors and eigenvalues for PCA. 4. Find top eigenvectors and eigenvalues. 5. New principal components are variables as a constraint of the variables with the highest variance. 6. Develop new features with columns as eigenvectors. 7. Recast data along new principal axes.
pol_ds_scaled_cor <- cor(pol_ds_scaled, method="pearson")
corrplot(pol_ds_scaled_cor, order ="alphabet", tl.cex=0.6)
Although there a lot of variables on the plot we can clearly see that a lot of variables are correlated either negatively or positively.
As our data relates to social sciences we can definitely start with multidimensional scaling. Starting with definition, this is a nonlinear statistical technique that originated in psychometrics. MDS keeps the distance between the points and reduces a number of dimensions. Distances allow to preserve a pattern and clusters. The metrics used in this methods, called proximity, describes there is a number that indicates how similar two objects are or are perceived to be.
In order to describe scaling models they can be broadly classified into metric vs non-metric and strain (classical scaling) vs stress (distance scaling) based MDS models.
As MDS is one of the most widely used MDS techniques in data analysis of course it is not possible to provide a complete set of its uses. They find several applications in image processing or pre-process for algorithms that rely on Euclidean distances.
dist_pol_ds_scaled<-dist(t(pol_ds_scaled)) # as input we need distance between units
mds3<-cmdscale(dist_pol_ds_scaled, k=2) #k - the maximum dimension of the space
plot(mds3, type='n') # plot with labels
pointLabel(mds3, labels = pol_ds %>% mutate(year = as.character(year)) %>% pull(year), cex=0.6, adj=0.5)
Using smacof package allows for different measurement scales of data such as ratio, interval, oridinal and mspline.
dis2<-sim2diss(pol_ds_scaled_cor, method=1, to.dist=TRUE)
fit.data<-mds(dis2, ndim=2, type="ratio")
plot(fit.data, pch=21, cex=as.numeric(fit.data$spp), bg="red")
In order to measure how good is MDS functionality stress function can be taken into account.
fit.data$stress
[1] 0.1027908
Proposes values have been proposed for measuring the goodness of the fit. 0.1 which is our case provides information that this is a fair result. This value should be minimized. For this different methods can be analyzed.
dis2<-sim2diss(pol_ds_scaled_cor, method=1, to.dist=TRUE)
fit.data<-mds(dis2, ndim=2, type="ordinal") # from smacof::
fit.data$stress
[1] 0.08331801
The smallest value is yield by an ordinal method which is non-metric MDS type where the only thing that is important is the order relations between the dissimilarities.
xxx.pca1<-prcomp(pol_ds_scaled, center=FALSE, scale.=FALSE)
For explaining 99% of the variance we need to take only the first two Principal Components.
xxx.pca2<-princomp(pol_ds_scaled) # stats::princomp()
plot(xxx.pca2)# the same will be plot(xxx.pca1)
fviz_pca_var(xxx.pca1, col.var="steelblue")
Based on this plot the following intuition can be achieved: parreg_P4 is oppositely correlated to the v2x_divparctrl_VDEM.
fviz_eig(xxx.pca1, choice='eigenvalue') # eigenvalues on y-axis
fviz_eig(xxx.pca1) # percentage of explained variance on y-axis
var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')
Rotated PCA allows easier interpretation of factors used in analysis due to the change in the structure of components. PCA rotates components across the axis of the factors. The most commonly used option is varimax which minimizes the number of variables needed to explain the given factors.
xxx.pca4<-principal(pol_ds_scaled, nfactors=2, rotate="varimax")
summary(xxx.pca4)
Factor analysis with Call: principal(r = pol_ds_scaled, nfactors = 2, rotate = “varimax”)
Test of the hypothesis that 2 factors are sufficient. The degrees of freedom for the model is 1709 and the objective function was 206.88 The number of observations was 89 with Chi Square = 13550.72 with prob < 0
The root mean square of the residuals (RMSA) is 0.04
plot(xxx.pca1)
plot(xxx.pca1, type = "l")
fviz_eig(xxx.pca1)
Uniqueness is the relation of variance according to share in other variables. For PCA we want it to be low, then we know that the variable is not correlated with other variables in the model. This is similar to multicollinearity.
Complexity corresponds to the number of factor loads that take values greater than zero. Given the relatively large loading of one factor only the complexity is near to 1. To paraphrase, we want to know what is the overall contribution of all factors to the single variable. We want to keep our contribution low because it involves a more difficult interpretation of factors.
set<-data.frame(complex=xxx.pca4$complexity, unique=xxx.pca4$uniqueness)
set.worst<-set[set$complex>1.8 & set$unique>0.78,]
set.worst %>% kable_head
[1] "
| complex | unique | |
|---|---|---|
| v2xel_elecpres_VDEM | 1.927795 | 0.8468625 |
| parreg_P4 | 1.954955 | 0.8248794 |
"
Combining complexity and uniqueness together we can spot variables that may lead to problems.
fviz_pca_ind(xxx.pca1, col.ind="cos2", geom="point", gradient.cols=c("white", "#2E9FDF", "#FC4E07" ))
The problem of estimating a number of clusters in a dataset can be quite hard which then results in yielding unexpected results. Domain experts may claim that 13 clusters in the dataset with 89 observations is surprisingly a lot. On the other hand, the so-called gap statistics have been used along with cluster validation package that can help with tasks that are performed without external knowledge.
The data set was very useful in understanding problems of International Political Economy. The results of clustering are very promising in terms of their proposed definition that suggests that the polish history can be very appealing topic to know better taking into account its variability.
Brock, Guy, Vasyl Pihur, Susmita Datta, and Somnath Datta. 2008. “ClValid: An R Package for Cluster Validation.” Journal of Statistical Software, Articles 25 (4): 1–22. https://doi.org/10.18637/jss.v025.i04.
Fidelis, Malgorzata. 2004. “Equality Through Protection: The Politics of Women’s Employment in Postwar Poland, 1945-1956.” Slavic Review 63 (2): 301–24. http://www.jstor.org/stable/3185730.
Graham, Benjamin A. T., and Jacob R. Tucker. 2016. “The International Political Economy Data Resource.” Harvard Dataverse. https://doi.org/10.7910/DVN/X093TV.
Mayhew, David R. 1991. “Divided Party Control: Does It Make a Difference?” PS: Political Science and Politics 24 (4): 637–40. http://www.jstor.org/stable/419392.
Sandberg, Per, Mikael AND Lundberg. 2012. “Political Institutions and Their Historical Dynamics.” PLOS ONE 7 (10): 1–10. https://doi.org/10.1371/journal.pone.0045838.
Spence, J. E. 1995. “Two worlds of international relations: academics, practitioners and the trade in ideas.” International Affairs 71 (2): 359–60. https://doi.org/10.2307/2623442.
Staniak, Mateusz, and Przemysław Biecek. 2019. “The Landscape of R Packages for Automated Exploratory Data Analysis.” The R Journal 11 (2): 347. https://doi.org/10.32614/rj-2019-033.
Tibshirani, Robert, Guenther Walther, and Trevor Hastie. 2001. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” Journal of the Royal Statistical Society Series B 63 (February): 411–23. https://doi.org/10.1111/1467-9868.00293.