Analysis of the ‘Poland A and B’ division using clustering and dimension reduction methods

1. Introduction

In Polish media we often hear a phrase “Poland A and B” (Polska A i B) referring to the political, cultural and economic division between western and eastern side of Poland. Usually the areas on the western side of the Vistula river are considered to be more developed, therefore they are called Poland A. On the other hand, the eastern areas are considered poorer and backward and are called Poland B. However, the exact shape of the border between those two is not commonly agreed as some propose other divisions, such as northern and southern Poland [1], or a division into Poland A, B and C where Poland C would be more developed than Poland B, but still underperforming compared to Poland A [2].

Investigating the division into “Poland A and B” is an interesting problem as it highlights internal differences between regions of Poland. It might suggests possible areas for improvement, for example, resources redistribution.

In this article I would like to examine the division of voivodships into Poland A and Poland B using various clustering and dimension reduction methods. Although voivodship are not entirely internally homogeneous, using them can provide clearer division which is easier to interpret than analyzing for example, powiatys. Additionally, data for voivodships is more accessible.

Firstly, I will attempt to cluster voivodships without reducing the number of variables taken into account. To do this, I will use hierarchical agglomerative, hierarchical divisive, k-means and k-medoids clustering. Then, I will apply dimension reduction on the dataset using principal component analysis and multidimensional scaling. Finally, I will perform clustering on the dataset with reduced number of variables.

2. Data description, preliminary analysis and clusterization method selection

The data used in this this project comes from Bank Danych Lokalnych (Local Data Bank) by Głowny Urząd Statystyczny (Statistics Poland). The data consist of 20 variables for all of 16 voivodships in Poland.

Variables measure different aspects of social, cultural and economic development. They can be grouped in 3 categories: - Social care and health care – 6 variables marked by x in the following order: Doctors for 10k people, Places in nurseries for 1k children aged 0-3, No. of people for 1 hospital bed, Nurses for 10k people, Percentage of childeren aged 0-3 under institutional care, No. of beds in hospices and nursing homes for 100k people - Economy – 9 variables marked by y in the following order: Average monthly salary in the enterprise sector, Unemployment rate, Monthly salary, Retail sale of goods per capita, Production sold in the industry sector per capita, Number of economic entities per 1k people, Agriculture production per 1 ha of farming area, Investments per capita in private and public sector, Expenditure on innovative activities in enterprises per 1 employed person - Culture and education – 6 variables marked by z in the following order: No. of people per 1 cinema seat, Percentage of children attending primary schools, Matura exam pass rate, No. of university graduates, Percentage of foreign students

Some variables needed to be transformed in order to represent higher levels of development for higher values.

For preliminary analysis I generated a boxplot for each variable.

# Outliers - based on boxplots
par(mfrow=c(5,4), mar = c(1.5,2,1.5,2))
for (i in 1:20){
  boxplot(var[, i])
}

par(mfrow=c(1,1))

We see that in most cases, interquartile range is wide and there are some variables which contain significant outliers.

I also calculated and visualized the Pearson correlation matrix for each variable.

# Correlation matrix
corr <- cor(var, method="pearson") 
# print(corr, digits=2)
corrplot(corr, order ="alphabet", tl.cex=0.6)

We see that the vast majority of variables are highly positively correlated with each other. It was expected as they all measure development level of particular region, but in different areas. However, two pairs of variables: x2 (places in nurseries for 1k children aged 0-3) and x5 (percentage of children aged 0-3 under institutional care) as well as y1 (average monthly salary in the enterprise sector) and y3 (average monthly salary) are very highly correlated, with the correlation coefficients being 0.959 and 0.978 respectively. This is because those variables are directly related and measuring only a slightly different phenomena. At the same time, there is no reason to say that one variable from the pairs is better than the other in measuring development. Therefore, in the later part of the project, I will conduct dimension reduction and then clustering.

Some variables, for example x3 and x4 with y4, y5 and y6, are negatively correlated, but only to a limited extend.

Hopkins statistics for this data is 0.45. It is close to 0.5 which would suggest that data is random. As this is just theroetical approach to the problem of voivodships’ clusterization, I decided to nevertheless continue with my analysis.

hopkins(var_scaled, n=nrow(var_scaled)-1)

## $H
## [1] 0.4044163

One of the main characteristic of the dataset is that the number of observations is very small. Thanks to that, we can effectively implement hierarchical clustering methods: agglomerative and divisive. Hierarchical clustering enables us to better explore characteristics of associations between different voivodships. Apart from hierarchical clustering, I have also conducted k-means and k-medoids clustering. I decided to incorporate both methods as there are some outliers in the dataset, but as was mentioned before the dataset is small. I have chosen k-medoids compared to k-means as the dataset had some outliers which could influence the results.

3. Clustering without reducing data dimensions

Firstly, I attempted to cluster the data without reducing dimension of variables i.e. taking all 20 variables into account.

3.1 Hierarchical clustering

Hierarchical clustering seeks to build hierarchy of clusters. There are two types of hierarchical clustering: agglomerative, in which each observation starts with its own cluster and in next steps these clusters are joined together; and divisive, in which all observations start in one cluster which in next steps is broken down until obtaining individual observations.

I began my analysis with agglomerative hierarchical clustering using AGNES algorithm. Using different linkage methods, I obtained the highest agglomerative coefficients for Ward’s and complete linkage methods, 0.59 and 0.61 respectively. As both of them provided very similar results (as shown in the comparison graph), I continued my analysis with complete linkage.

# Dissimilarity matrix
d <- dist(var_scaled, method = "euclidean")

# Linkage methods with agglomerative coefficiece from highest to lowest
# hieragg_single_wdr <- agnes(d, method = "single")
# coef(hieragg_single_wdr)
# hieragg_average_wdr <- agnes(d, method = "average")
# coef(hieragg_average_wdr)
hieragg_ward_wdr <- agnes(d, method = "ward")
paste("Agglomerative coefficiece, Ward's method: ", coef(hieragg_ward_wdr))

## [1] "Agglomerative coefficiece, Ward's method:  0.592768516060217"

hieragg_complete_wdr <- agnes(d, method = "complete")
paste("Agglomerative coefficiece, complete method: ", coef(hieragg_complete_wdr))

## [1] "Agglomerative coefficiece, complete method:  0.610560941020849"

# Complete and ward linkage methods are quite similiar
dend1 <- as.dendrogram (hieragg_complete_wdr)
dend2 <- as.dendrogram (hieragg_ward_wdr)
tanglegram(dend1, dend2)

In case of hierarchical clustering, choice of the number of clusters is more subjective. The elbow method didn’t provide any clear suggestion, however shilouette method suggests 2 clusters. I would say that voivoideships could be splitted into 2, 3 or 4 clusters.

fviz_nbclust(var_scaled, FUN = hcut, method = "wss")

fviz_nbclust(var_scaled, FUN = hcut, method = "silhouette")

We see that Mazowieckie alone stands out from rest of voivodeships. However, the rest of the voivoidships are roughly clustered according to their geography: for example, Dolnośląskie, Łódzkie, Śląskie and Małopolskie forming one of the subclusters all lie in south of Poland, nearby each other.

# Dendrogram
pltree(hieragg_complete_wdr, cex = 0.6, hang = -1, labels = as.vector(t(df[, 1])))
# pltree(hieragg_ward_wdr, cex = 0.6, hang = -1, labels = as.vector(t(df[, 1])))
rect.hclust(hieragg_complete_wdr, k = 2, border = 2:5)
rect.hclust(hieragg_complete_wdr, k = 3, border = 2:5)
rect.hclust(hieragg_complete_wdr, k = 4, border = 2:5)

Results from divisive hierarchical clustering using DIANA algorithm vary from the agglomerative approach. Here, Dolnośląskie and Mazowieckie stand out from the rest with other voivodships being quite similar. Divisisive coefficient in this case is 0.63.

hierdiv_wdr <- diana(var_scaled)
hierdiv_wdr$dc

## [1] 0.6296519

# Dendrogram
pltree(hierdiv_wdr, cex = 0.6, hang = -1, labels = as.vector(t(df[, 1])))

3.2 K-means

K-means is a clustering approach which aims to partition observations into clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

In order to run k-means, I needed to determine the optimal number of clusters. To do this, I used elbow, silhouette and AIC criteria.

opt <- Optimal_Clusters_KMeans(var_scaled, max_clusters=8, plot_clusters = TRUE)

opt <- Optimal_Clusters_KMeans(var_scaled, max_clusters=8, plot_clusters=TRUE, criterion="silhouette")

opt <-Optimal_Clusters_KMeans(var_scaled, max_clusters=8, plot_clusters=TRUE, criterion="AIC")

Based on the plots, we can see that both 2 and 3 clusters seem optimal and are very similar in terms of clustering quality with 3 clusters having only slightly higher AIC. This is why I analyzed both of these options.

kmeans2wdr <- kmeans(var_scaled, 2)
kmeans3wdr <- kmeans(var_scaled, 3)

# Visualization of clusters

fviz_cluster(list(data=var_scaled, cluster=kmeans2wdr$cluster), stand=TRUE, ggtheme=theme_classic(), labelsize = 7, ellipse.type="convex")

paste("Calinski-Harabasz statistics for 2 clusters: ", round(calinhara(var_scaled, kmeans2wdr$cluster),digits=2) )

## [1] "Calinski-Harabasz statistics for 2 clusters:  6.14"

groupBWplot(var_scaled, kmeans2wdr$cluster, alpha=0.05)

fviz_cluster(list(data=var_scaled, cluster=kmeans3wdr$cluster), stand=TRUE, ggtheme=theme_classic(), labelsize = 7, ellipse.type="convex")

paste("Calinski-Harabasz statistics for 3 clusters: ", round(calinhara(var_scaled, kmeans3wdr$cluster),digits=2))

## [1] "Calinski-Harabasz statistics for 3 clusters:  5.04"

groupBWplot(var_scaled, kmeans3wdr$cluster, alpha=0.05)

Calinski-Harabasz statistics is higher for two clusters than for three, which suggest that it is a better solution. In both cases geographical division for Poland ‘A’ and ‘B’ is rather non-existent as in the same clusters we have voivodships which are far away such as for example Podlaskie, Opolskie and Lubuskie. However, we see that for 2 clusters, 1st cluster have generally higher quality of life in most categories than 2nd cluster. Similarly, for 3 clusters, cluster 3rd have the highest values for most variables.

3.3. K-medoids

K-medoids approach is similar to k-means, but instead of using means it uses datapoints as centers of clusters.

Firstly, I determined the optimal number of clusters using elbow and shilouette criteria. Here again both 2 and 3 clusters seemed reasonable, but based on the Calinski-Harabasz statistics 2 is better. I analyzed division into 2 clusters using Partitioning Around Medoids (PAM) algorithm.

pam2wrd <- pam(var_scaled, 2) 
paste("Calinski-Harabasz statistics for 2 clusters: ", round(calinhara(var_scaled, pam2wrd$cluster),digits=2))

## [1] "Calinski-Harabasz statistics for 2 clusters:  4.37"

pam3wrd <- pam(var_scaled, 3) 
paste("Calinski-Harabasz statistics for 3 clusters: ", round(calinhara(var_scaled, pam3wrd$cluster),digits=2))

## [1] "Calinski-Harabasz statistics for 3 clusters:  4.53"

fviz_cluster(pam2wrd, stand=TRUE, ggtheme=theme_classic(), labelsize = 7, ellipse.type="convex")

groupBWplot(var_scaled, pam2wrd$cluster, alpha=0.05)

Similar to agglomerative hierarchical clustering, we see that Mazowieckie stands out from the other voivodships. It obviously have the highest values for most of characteristics.

4. Data dimension reduction

As it was previously mentioned, many variables are highly correlated with each other. It suggests that they might carry similar information, so they do not have to be all included in the analysis. Dimension reduction makes interpretation easier and cleaner. Usually, reducing data dimensions also provide other advantages such as less time and space consuming calculations, or smaller probability of overfitting the data, but in this particular case of small dataset containing all voivodships those are neglectable.

4.1 Principal components analysis

In Principal Component Analysis (PCA) we search for directions in space that have the highest variance and project data onto that subspaces.

As the data I collected was already grouped in three categories, I decided to firstly attempt to run PCA within them to obtain certain indicators of qualities of voivodships in each category.

pca_social <- prcomp(var_scaled_social, center=FALSE, scale=FALSE)
summary(pca_social)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     1.6232 1.5061 0.72027 0.65287 0.35937 0.15047
## Proportion of Variance 0.4391 0.3780 0.08646 0.07104 0.02152 0.00377
## Cumulative Proportion  0.4391 0.8172 0.90366 0.97470 0.99623 1.00000

pca_social$rotation

##           PC1        PC2        PC3         PC4        PC5         PC6
## x1 -0.4639497  0.1832963 -0.8182963  0.07197208 -0.2232899  0.16280621
## x2 -0.2507300 -0.5924143  0.1008469  0.18721661  0.2338656  0.69732813
## x3 -0.3998673  0.3399471  0.3849302  0.74710507 -0.1133346 -0.07321323
## x4 -0.4677642  0.3598608  0.1164706 -0.36838783  0.7085873 -0.01805092
## x5 -0.2603048 -0.5847297 -0.1674468  0.17996752  0.2286029 -0.69111974
## x6 -0.5248136 -0.1687456  0.3611730 -0.48322467 -0.5729298 -0.06241078

pca_economy <- prcomp(var_scaled_economy, center=FALSE, scale=FALSE)
summary(pca_economy)

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.368 1.1922 0.9780 0.65143 0.57006 0.34946 0.33203
## Proportion of Variance 0.623 0.1579 0.1063 0.04715 0.03611 0.01357 0.01225
## Cumulative Proportion  0.623 0.7809 0.8872 0.93434 0.97045 0.98402 0.99627
##                            PC8     PC9
## Standard deviation     0.15233 0.10184
## Proportion of Variance 0.00258 0.00115
## Cumulative Proportion  0.99885 1.00000

pca_economy$rotation

##            PC1         PC2         PC3        PC4         PC5         PC6
## y1 -0.39309577 -0.18806518  0.09607487 -0.3405268  0.21328961 -0.18223420
## y2  0.04376911 -0.64730064 -0.59623744  0.2256234  0.24618047  0.07497877
## y3 -0.39048136 -0.24923434  0.04025475 -0.2742869  0.09582903 -0.18984720
## y4 -0.33013021  0.30936724 -0.22256441  0.6478494 -0.25499704 -0.07963150
## y5 -0.36875592  0.30150769  0.05002838 -0.1865923  0.19516737  0.58145742
## y6 -0.37027736 -0.04001874  0.34579549  0.3488567  0.24932700 -0.50016377
## y7 -0.21115408  0.45518313 -0.64303980 -0.2042653  0.28605762 -0.18473223
## y8 -0.37639485 -0.23838081  0.12451009  0.2816289  0.07329481  0.54177993
## y9 -0.35036118 -0.17686731 -0.18175495 -0.2462449 -0.79513563 -0.05364450
##            PC7         PC8         PC9
## y1  0.02291072  0.09737089  0.77214671
## y2  0.32539101 -0.01327918 -0.01997157
## y3 -0.21638669  0.58469511 -0.52405729
## y4  0.04574552  0.45495018  0.21360065
## y5  0.58162115  0.03662676 -0.14136812
## y6  0.25681810 -0.43614398 -0.22696430
## y7 -0.29358283 -0.28944319 -0.08411049
## y8 -0.58075673 -0.26496948  0.01729367
## y9  0.13027655 -0.30975031 -0.06525284

pca_education <- prcomp(var_scaled_education, center=FALSE, scale=FALSE)
summary(pca_education)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.6816 0.9914 0.8687 0.56771 0.33548
## Proportion of Variance 0.5655 0.1966 0.1509 0.06446 0.02251
## Cumulative Proportion  0.5655 0.7621 0.9130 0.97749 1.00000

pca_education$rotation

##           PC1        PC2        PC3        PC4        PC5
## z1 -0.3712591  0.7308668 -0.2359847  0.2449178  0.4607894
## z2 -0.5236988 -0.1075940  0.2578878  0.6551143 -0.4674209
## z3 -0.4416311 -0.1357241  0.7002036 -0.3591791  0.4089576
## z4 -0.5163461  0.1431957 -0.3256979 -0.6082577 -0.4866469
## z5 -0.3553238 -0.6444619 -0.5305101  0.1088742  0.4063487

On the one hand, we obtain relatively easy to interpret indicators of quality of life in those categories. First principal component in each category would serve as general indicator, while the second component might carry some additional meaning. For example, in Social variables x1, x3 and x4 refer to the quality of medical care, while variables x2, x5 and x6 refer to other social care such as places in children clubs or care for the elderly. Therefore, PC2 might be interpreted as as indicator of medical care. In case of Economy, PC2 indicates retail, agriculture and production sales. However, in case of Culture and Education it is hard to interpret PC2.

On the other hand, we would need at least two principal components in each category to obtain 70% of explained variance in each of them. This is why I also analyzed all variables together.

pca <- prcomp(var_scaled, center=FALSE, scale=FALSE)
summary(pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6    PC7
## Standard deviation     2.9695 1.8758 1.5698 1.16470 1.0421 0.98065 0.7322
## Proportion of Variance 0.4409 0.1759 0.1232 0.06783 0.0543 0.04808 0.0268
## Cumulative Proportion  0.4409 0.6168 0.7400 0.80787 0.8622 0.91025 0.9371
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.62080 0.56451 0.50756 0.36352 0.31202 0.22547 0.11327
## Proportion of Variance 0.01927 0.01593 0.01288 0.00661 0.00487 0.00254 0.00064
## Cumulative Proportion  0.95632 0.97225 0.98514 0.99174 0.99661 0.99915 0.99979
##                           PC15      PC16
## Standard deviation     0.06407 3.062e-15
## Proportion of Variance 0.00021 0.000e+00
## Cumulative Proportion  1.00000 1.000e+00

fviz_eig(pca)

fviz_pca_var(pca)

pca$rotation

##             PC1          PC2          PC3         PC4         PC5         PC6
## x1 -0.208161784  0.344257212  0.057352999  0.01933787  0.13863593  0.01554364
## x2 -0.160262549 -0.137712880 -0.496397839 -0.09740127 -0.10517734  0.17465212
## x3  0.016417654  0.404824431 -0.045104352  0.02503175  0.36239511  0.45578724
## x4 -0.039554142  0.483544762 -0.004333687 -0.23077905  0.09638823 -0.10329318
## x5 -0.213647644 -0.145934680 -0.447973358  0.02008703 -0.03919355  0.09542025
## x6 -0.132014553  0.239205840 -0.376093834 -0.30938373  0.14906479 -0.03461371
## y1 -0.319048287 -0.005941777 -0.046894913  0.09105726  0.11502035 -0.05548036
## y2  0.005479046  0.373708661  0.051164847  0.49264823 -0.19683756 -0.12778745
## y3 -0.318637205  0.054589882 -0.036559038  0.09529697  0.07632138 -0.03426290
## y4 -0.231313024 -0.202939639  0.298473285  0.02762314 -0.16070561  0.13300319
## y5 -0.267355139 -0.215118694  0.107723905 -0.08788970  0.26072739  0.17407522
## y6 -0.272183784 -0.221607461 -0.067375728  0.19740066  0.02768390 -0.06888108
## y7 -0.138741858 -0.035223352  0.485623585 -0.15017152  0.12622732  0.35777719
## y8 -0.300087521 -0.032226827 -0.098146846  0.22868143  0.04154977  0.05381589
## y9 -0.291503963  0.102223714  0.120342064 -0.06983264 -0.01860286 -0.26206207
## z1 -0.240756283 -0.017169146  0.089545337  0.16912719  0.44381966 -0.24499473
## z2 -0.292352250 -0.043657999  0.102188288 -0.29534341 -0.20120393  0.19319667
## z3 -0.187735765  0.132060280  0.100649351 -0.51430550 -0.31112852 -0.29700740
## z4 -0.255504545  0.097995973  0.076373562  0.19107592 -0.22659550 -0.27337410
## z5 -0.159437335  0.255128604 -0.019153535  0.18980691 -0.49935774  0.45535785
##            PC7         PC8         PC9         PC10          PC11        PC12
## x1  0.39090349  0.32867087 -0.17532881  0.071745156 -0.5104560261 -0.11149574
## x2 -0.17824219  0.01456796 -0.19239217 -0.229085900 -0.0907814159 -0.06999434
## x3 -0.07631013 -0.09564390 -0.37650931  0.174313153  0.2645657654  0.07234191
## x4  0.12126554 -0.34373526  0.02583914 -0.164699244 -0.0226270547  0.29690483
## x5 -0.01984884 -0.07594964 -0.12563598 -0.154341827 -0.0516706774 -0.03305150
## x6 -0.28500064  0.03493476  0.49833329  0.160880193 -0.1378611324  0.26450579
## y1  0.21252482  0.25973803  0.10810671 -0.183993271  0.1934873189 -0.06853894
## y2 -0.17751660 -0.20543268  0.31759959 -0.428087050  0.0007376602 -0.10326165
## y3  0.28936091  0.08977338  0.09923037 -0.049948855  0.3501588658  0.17183255
## y4 -0.12216074 -0.42916355 -0.07464810  0.238089646 -0.2195262120  0.40145329
## y5 -0.15037848  0.06371276  0.31954211 -0.004542771  0.2637609674 -0.03158698
## y6  0.36525973 -0.12278428  0.08531730 -0.029864570 -0.2816145129  0.32352048
## y7 -0.08930015  0.11454347  0.08442013 -0.428460258 -0.0356236528  0.04186506
## y8  0.10099166 -0.42536120 -0.18506865  0.007044213  0.2334391373 -0.18366178
## y9  0.01091859 -0.12944303  0.16821481  0.501429592  0.1204043778 -0.41274014
## z1 -0.47634472 -0.05285133 -0.18324360 -0.078171013 -0.3557977002 -0.22756722
## z2 -0.04078635  0.04691692  0.05035547 -0.034039108 -0.1143808418 -0.17072184
## z3  0.01155773 -0.12857071 -0.27652257 -0.259391796  0.1190400903 -0.11791590
## z4 -0.35544146  0.44041126 -0.30316392  0.107628626  0.2108014823  0.42965623
## z5 -0.09264404  0.10067579  0.11906104  0.191714752 -0.1155442153 -0.14571478
##           PC13         PC14         PC15         PC16
## x1 -0.01888558 -0.125596307 -0.396926173  0.063894534
## x2 -0.10613674  0.248338433  0.032449381 -0.122155446
## x3  0.24134575  0.084354228  0.240480089  0.108388900
## x4 -0.09134081  0.285644179 -0.144165407 -0.432103934
## x5 -0.22436770  0.165274534 -0.161936580 -0.037169216
## x6 -0.11478220 -0.338674349  0.068163210  0.263284864
## y1  0.01438416  0.132399894 -0.010966766 -0.066295210
## y2  0.15937477 -0.017466933 -0.031755561 -0.007826619
## y3 -0.02653536 -0.075059556  0.230649676  0.030492125
## y4 -0.02014329  0.048039254 -0.212149755  0.012304098
## y5  0.38671933  0.213209760 -0.525377724  0.057036987
## y6  0.24100776  0.160912021  0.401165775  0.184873271
## y7 -0.53240712  0.007273013  0.171714991  0.123762056
## y8 -0.20863841 -0.556272334 -0.188822688  0.065091691
## y9 -0.31052762  0.327571582  0.173617485 -0.089105285
## z1  0.12021428  0.049982792  0.178359800  0.036622437
## z2  0.35182126 -0.369376756  0.251847608 -0.580324654
## z3  0.23160565  0.036872130 -0.003946114  0.491068491
## z4 -0.08584593 -0.057091655 -0.062955621 -0.134090053
## z5  0.03553606  0.185338427  0.033465343  0.210057723

df_pca <- pca$x[, 1:6]

Now we can see that we can obtain higher values of explained variance for even lower number of principal components. Unfortunately, these principal components are also harder to interpret. Only the first principal component might roughy correnspond to general quality of life as almost all variables there have positive rotation.

Based on the scree plot, I would say that six variables would be enough for further analysis. Together they explain over 91% of total variation. However, as they are hard to interpret, I have chosen to proceed the analysis instead with six variables based on the categorical division.

4.2. Multidimensional Scaling

Multidimensional Scaling (MDS) transforms dataset into one with lower number of dimensions while keeping distances representing dissimilarities between points. To perform MDS, we calculate distances between all pairs of observations (in this case using Euclidean metric) and then based on that we calculate new artificial coordinates (in this case in 2D plane) and optimize their values.

#cor(t(var_scaled))
dist <- dist(var_scaled)
mds1 <- cmdscale(dist, k=2)
summary(mds1)

##        V1                V2         
##  Min.   :-4.3391   Min.   :-3.5608  
##  1st Qu.:-1.7635   1st Qu.:-1.1610  
##  Median :-0.8862   Median :-0.2315  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5989   3rd Qu.: 0.7920  
##  Max.   : 7.8773   Max.   : 4.1164

names <- rownames(var_scaled)
plot(mds1, type = "n")
text(mds1, labels=names, cex=0.5, adj=0.5)

Based on the MDS vizualization, we see that all voivodships except Mazowieckie are rather grouped together. Wielkopolskie and Lubelskie might be on peripheries of that group. Mazowieckie is far away from the rest of voivodships – the same result was obtained for k-medoids and agglomerative hierarchical clustering. Statistics of both groups Mazowieckie and the rest of voivodhsips were analyzed before and indeed, they are different, which was confirmed by MDS. As MDS coordinates do not carry any business values, I treat it as an additional analysis showing differences between voivodships.

5. Clustering on dimension reduced data

I have also conducted clustering on dataset with its dimensions reduced using Principal Component Analysis. I have chosen two first principal components for every category (Economy, Social, and Culture and Educations), obtaining 6 variables in total. I have chosen these as they had clear interpretations compared to principal components derived from conducing PCA on all variables at the same time.

5.1. Hierarchical clustering

Firstly, I checked what number of clusters would be optimal - based on shilouette and partially elbow method that would be probably 2.

I run the agglomerative hierarchical clustering to see the structure of clusters. Indeed, we obtained two clear clusters. This time however, it is hard to relate it to any geographical division of Poland, as we have voivodships lying far away from each other (such as for example Opolskie and Podlaskie) being in the same cluster. Agllomerative coefficient of this clustering is slightly higher than before, 0.64.

df_pca <- cbind(pca_social$x[, 1:2], pca_economy$x[, 1:2], pca_education$x[, 1:2])
colnames(df_pca) <- c("social1", "social2", "economy1", "economy2", "education1", "education2")

# Dissimilarity matrix
df_pca_scaled <- scale(df_pca)
d2 <- dist(df_pca_scaled, method = "euclidean")

fviz_nbclust(df_pca_scaled, FUN = hcut, method = "wss")

fviz_nbclust(df_pca_scaled, FUN = hcut, method = "silhouette")

# Linkage methods with agglomerative coefficiece from highest to lowest
hieragg_complete_pca <- agnes(d2, method = "complete")

pltree(hieragg_complete_pca, cex = 0.6, hang = -1, labels = as.vector(t(df[, 1])))

coef(hieragg_complete_pca)

## [1] 0.6444353

In terms of divison agglomerative clustering, this time Wielkopolskie stands out from the rest of voivodships and once again, it is hard to point a clear geographical division.

hierdiv_complete_pca <- diana(d2)
coef(hierdiv_complete_pca)

## [1] 0.6108398

pltree(hierdiv_complete_pca, cex = 0.6, hang = -1, labels = as.vector(t(df[, 1])))

5.2. K-means

Similarly as with the unprocessed data, it was hard to decide between two or three clusters and therefore I compared both.

opt <- Optimal_Clusters_KMeans(df_pca_scaled, max_clusters=8, plot_clusters = TRUE)

opt <- Optimal_Clusters_KMeans(df_pca_scaled, max_clusters=8, plot_clusters=TRUE, criterion="silhouette")

opt <-Optimal_Clusters_KMeans(df_pca_scaled, max_clusters=8, plot_clusters=TRUE, criterion="AIC")

It is hard draw geographical division around these clusters. However, we see that voivodships in the first cluster have generally higher indicators of life quality. Wielkopolskie stands out as an outlier - possible third cluster

fviz_cluster(list(data=df_pca_scaled, cluster=kmeans2wdr$cluster), stand=TRUE, ggtheme=theme_classic(), labelsize = 7, ellipse.type="convex")

groupBWplot(df_pca_scaled, kmeans2wdr$cluster, alpha=0.05)

fviz_cluster(list(data=df_pca_scaled, cluster=kmeans3wdr$cluster), stand=TRUE, ggtheme=theme_classic(), labelsize = 7, ellipse.type="convex")

5.3. K-medoids

For k-medoids using PAM I went with two and three clusters. For two clusters, we have the same division as in K-means algorithm.

#opt <- Optimal_Clusters_Medoids(df_pca_scaled, max_clusters=8, plot_clusters=TRUE, 'euclidean', criterion="silhouette")
pam2wrd <- pam(df_pca_scaled, 2) 
fviz_cluster(pam2wrd, stand=TRUE, ggtheme=theme_classic(), labelsize = 7, ellipse.type="convex")

For three clusters, we obtained an interesting division into three clusters with similiar number of elements. However, it it hard to pinpoint clusters that are “better” and “worse” as some have higher statistics in different categories.

pam3wrd <- pam(df_pca_scaled, 3) 
fviz_cluster(pam3wrd, stand=TRUE, ggtheme=theme_classic(), labelsize = 7, ellipse.type="convex")

groupBWplot(df_pca_scaled, pam3wrd$cluster, alpha=0.05)

Summary and conclusions

Based on the clustering analysis, it is hard to draw a clear geographical division between Poland’s voivodships which would correspond to ‘Poland A and B’. Dimension reduction provided a different view on the data. Depending on the algorithm, I obtained different divisions of voivodships. However, the main conclusions remained similar as in the non-processed data.

In case of raw data without dimension reduction, in most cases we can see that Mazowieckie stands out from the other voivodships. It is probably due to including Warsaw, Poland’s capital city which is relatively more developed compared to the rest of the country. After dimension reduction, Wielkopolskie becomes bigger outlier than Mazowieckie, which unfortunately I am not able to explain why. Mazowieckie and Wielkopolskie being outliers is also confirmed by MDS. We see that Mazowieckie is also often grouped with industrialized voivodships such as Śląskie and Dolnośląskie and Małopolskie, which contains second biggest Polish city Kraków.

As far as the rest of voivodships are concerned, different clustering methods and algorithms propose different divisions. Some regions are clearly better in terms of economical, social and cultural (educational) development, but they wouldn’t correspond to typical idea of ‘Poland A and B’ - division based on west/east or even north/south geography.

One have to take into account that this analysis also have its limitations. As it was mentioned before, voivodships are not entirely homogeneous, so as in the case of for example Mazowieckie and Warsaw, certain cities withing a region might be much more developed than the rest. Moreover, variables chosen for this analysis and their categorical assignment was subjective which could also influence the outcomes.

References

[1] https://www.gazetaprawna.pl/wiadomosci/artykuly/1071168,wojewodztwa-polnocne-podzial-na-wschod-i-zachod-polski.html
[2] https://gospodarka.dziennik.pl/news/artykuly/484389,nie-tylko-polska-a-i-polska-b-sciana-zachodnia-to-polska-c.html