Introduction

COVID-19 has exposed a long-running shortage of nurses and midwives that is putting patients at risk. Although the number of nurses in Poland has increased by 20,000 over the last 4 years, it is important to convert this data per 1,000 citizens. The rate per 1,000 inhabitants is now 5.2 in Poland, while the EU average is 9.4.
Therefore I find it interesting to clustering the number of nurses and midwives per 10,000 inhabitants to know which groups of examples are closely related. The purpose of this article is to test the hypothesis that identified groups in the data are not related to a geographic location. In this study I am going to use the unsupervised learning approach.
library(jpeg)
library(rasterImage)
library(readxl)
library(data.table)
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(gridExtra)
library(ggthemes)
library(plotly)
library(stringr)
library(kableExtra)

Dataset

The data comes from the Central Statistical Office in Poland from 2006-2019.

Source:https://bdl.stat.gov.pl/BDL/dane/podgrup/temat.

head(t_data)%>%kbl() %>%kable_paper("hover")%>%scroll_box(width = "910px")
DOLNOŚLĄSKIE KUJAWSKO-POMORSKIE LUBELSKIE LUBUSKIE ŁÓDZKIE MAŁOPOLSKIE MAZOWIECKIE OPOLSKIE PODKARPACKIE PODLASKIE POMORSKIE ŚLĄSKIE ŚWIĘTOKRZYSKIE WARMIŃSKO-MAZURSKIE WIELKOPOLSKIE ZACHODNIOPOMORSKIE
X2006 58.2 53.6 62.1 55.9 57.4 56.3 57.1 48.7 58.2 60.2 51.9 64.0 57.3 52.0 52.2 50.4
X2007 60.0 55.1 65.6 55.0 58.2 58.9 61.3 49.6 60.2 63.9 52.5 66.0 58.6 53.4 52.1 52.6
X2008 61.0 56.3 66.4 55.7 60.1 60.1 62.5 52.2 63.2 65.6 52.7 66.2 60.0 56.8 53.0 52.0
X2009 61.0 58.1 67.0 57.4 62.2 63.4 67.8 58.2 64.6 67.5 53.5 69.7 63.3 58.5 50.5 54.5
X2010 60.5 58.2 67.3 55.9 60.0 63.6 63.5 56.6 65.9 66.6 51.3 69.8 64.7 55.8 51.4 55.2
X2011 62.0 57.5 67.1 57.9 63.3 65.0 67.6 57.8 67.3 66.7 53.0 73.0 66.0 60.2 50.6 54.8
cat("Number of voivodeships in the dataset:", nrow(data))
## Number of voivodeships in the dataset: 16
cat("Number of years used in the analysis:", ncol(data))
## Number of years used in the analysis: 14
summary(t_data) %>%
  kbl() %>%
  kable_paper("hover")%>%
   scroll_box(width = "910px")
DOLNOŚLĄSKIE KUJAWSKO-POMORSKIE LUBELSKIE LUBUSKIE ŁÓDZKIE MAŁOPOLSKIE MAZOWIECKIE OPOLSKIE PODKARPACKIE PODLASKIE POMORSKIE ŚLĄSKIE ŚWIĘTOKRZYSKIE WARMIŃSKO-MAZURSKIE WIELKOPOLSKIE ZACHODNIOPOMORSKIE
Min. :58.20 Min. :53.60 Min. :62.10 Min. :55.00 Min. :57.40 Min. :56.30 Min. :57.10 Min. :48.70 Min. :58.20 Min. :60.20 Min. :51.30 Min. :64.00 Min. :57.30 Min. :52.00 Min. :50.30 Min. :50.40
1st Qu.:61.00 1st Qu.:57.65 1st Qu.:67.03 1st Qu.:56.27 1st Qu.:60.62 1st Qu.:63.45 1st Qu.:64.53 1st Qu.:56.90 1st Qu.:64.92 1st Qu.:66.62 1st Qu.:52.77 1st Qu.:69.72 1st Qu.:63.65 1st Qu.:57.23 1st Qu.:51.10 1st Qu.:54.58
Median :64.80 Median :61.90 Median :73.65 Median :61.45 Median :65.40 Median :67.60 Median :72.65 Median :61.70 Median :71.60 Median :69.70 Median :54.75 Median :72.55 Median :70.35 Median :60.60 Median :52.35 Median :57.65
Mean :64.34 Mean :61.41 Mean :72.10 Mean :60.29 Mean :64.80 Mean :68.31 Mean :71.11 Mean :60.77 Mean :70.85 Mean :68.89 Mean :55.96 Mean :73.11 Mean :70.43 Mean :59.99 Mean :52.66 Mean :57.75
3rd Qu.:67.22 3rd Qu.:64.25 3rd Qu.:77.15 3rd Qu.:62.50 3rd Qu.:68.20 3rd Qu.:73.33 3rd Qu.:76.17 3rd Qu.:66.92 3rd Qu.:76.00 3rd Qu.:71.25 3rd Qu.:57.15 3rd Qu.:76.58 3rd Qu.:75.42 3rd Qu.:63.17 3rd Qu.:53.00 3rd Qu.:59.60
Max. :71.60 Max. :70.50 Max. :82.80 Max. :65.80 Max. :75.00 Max. :79.30 Max. :85.90 Max. :70.10 Max. :84.40 Max. :75.10 Max. :68.60 Max. :82.40 Max. :86.00 Max. :66.50 Max. :57.70 Max. :67.50
Summary statistics show that the results do not differ markedly. The difference between the voivodeships is a maximum of 20 nurses per 10,000 citizens. The lowest average can be seen in the pomorskie voivodeship. On the other hand, the swiętokrzyskie, mazowieckie and podkarpackie voivodeships have achieved the highest maximum values.

Hopkins statistic - test for clusterability of data

H0: the dataset is uniformly distributed (i.e., no meaningful clusters)
If Hopkins statistic < 0.5, then it is unlikely that dataset has statistically significant clusters. In other words, if the value of Hopkins statistic is close to 1, then we can reject the null hypothesis and conclude that the dataset is significantly a clusterable data.
get_clust_tendency(data,2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123) 
## $hopkins_stat
## [1] 0.7214608
## 
## $plot

In my case Hopkins statistic is equal to 0.72, so I conclude that the dataset is significantly a clusterable data.
In the plot above, similar objects are close to one another. Red color corresponds to small distance and blue color indicates big distance between observation.

Determining the optimum number of cluster

a1 <-fviz_nbclust(data, FUN = hcut, method = "wss" , linecolor = "darkblue") + labs(title= "WSS") 
a2 <-fviz_nbclust(data, FUN = hcut, method = "silhouette", linecolor = "darkblue")+ labs(title= "SILHOUETTE") 
gap_stat <- clusGap(data, FUN = hcut, nstart = 25, K.max = 10, B = 50)
a3 <-fviz_gap_stat(gap_stat, linecolor = "darkblue")+ labs(title= "GAP STATISTIC") 
grid.arrange(a1, a2, a3,   ncol=3)

According to statistics: wss and silhouette, the optimal number of clusters is two.
Now I will use the same function but with 3 different algorithms and same statistic - silhouette.I am particularly interested in the result PAM, because is suitable for small datasets. In addition PAM is more robust to outliers than K-means since it uses median instead of mean. I reject Clara, because my dataset set is too small.
a <- fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette", linecolor = "darkblue") + theme_classic() + labs(title= "K-MEANS") 
b <- fviz_nbclust(data, FUNcluster = pam, method = "silhouette", linecolor = "darkblue") + theme_classic() + labs(title= "PAM") 
c <- fviz_nbclust(data, FUNcluster = hcut, method = "silhouette", linecolor = "darkblue") + theme_classic() + labs(title= "HCUT") 
grid.arrange(a, b, c,   ncol=3)

According to statistics above, the optimal number of clusters is also two.

Visualising Clusters

cl_kmeans <- eclust(data, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE) 
km=fviz_cluster(cl_kmeans, data = data, elipse.type = "convex", palette="jco") + theme_minimal()+ labs(title= "K-MEANS") + theme_dark() + theme(panel.background = element_rect(fill = "#2D2D2D"))

cl_pam <- eclust(data, k=2, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
pa=fviz_cluster(cl_pam, data = data, elipse.type = "convex", palette="jco") + theme_minimal()+ labs(title= "PAM") + theme_dark() + theme(
  panel.background = element_rect(fill = "#2D2D2D"))

grid.arrange(km, pa, ncol=2)

Thanks to the visualization of the groups, we can see that the grouped data corresponds to the geographical location of Poland. Well, the groups show that there is a difference between the eastern and western parts of the country.

Checking the quality of partitioning - Silhouette width

 k=fviz_silhouette(cl_kmeans, palette="jco")+ labs(title= "K-MEANS")  + theme(
    panel.background = element_rect(fill = "white"))
##   cluster size ave.sil.width
## 1       1    9          0.47
## 2       2    7          0.67
  p=fviz_silhouette(cl_pam, palette="jco")+ labs(title= "PAM")  + theme(
    panel.background = element_rect(fill = "white"))
##   cluster size ave.sil.width
## 1       1    9          0.47
## 2       2    7          0.67
  grid.arrange(k, p, ncol=2)

Average silhouette width is 0,56. For first and second cluster silhouette is 0,47 and 0,67 respectively.These values are the closest to one, so I conclude that observations are part of the right cluster.

Shadow statistics

Shadow statistics are very close to silhouette. The main difference between silhouette values and shadow values is that we replace average dissimilarities to points in a cluster by dissimilarities to centroids.
  d1<-cclust(data, 2, dist="euclidean") #flexclust:: for k-means
  shadow(d1) 
##         1         2 
## 0.3538258 0.5036630
  plot(shadow(d1))

Low shadow values (<=0,5) indicate clusters with good separation.

Hierarchical clustering

 hcw <- eclust(data, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method="ward.D2")
  plot(hcw, cex=0.5, hang=-1)
  rect.hclust(hcw, k=2, border='#CC9900')

The obtained results using Ward’s method are very similar. These results fully confirm the previous ones obtained with K-means and PAM.

Description of each Cluster

In this part, the plot visualizes the obtained clusters over the years 2006-2019.
c2 <- eclust(data,FUNcluster = "kmeans" ,k = 2,graph = FALSE)
  data_1_long <- reshape2::melt(datawykres)
  data_1_long <- merge(data_1_long,data.frame('nm' = names(c2$cluster),'cl' = c2$cluster),by.x='Nazwa',by.y='nm')
  names(data_1_long)[2] <- "Date"
  names(data_1_long)[3] <- "Value"
  
  ggplotly(data_1_long %>% ggplot() + geom_line(aes(y=Value,x=Date,col=cl)) + theme_bw())

Increasing the number of clusters by one

The statistics used in the previous section indicated that the optimal number of clusters is two. However, I believe it is worth analyzing the three groups and assessing whether the results relate to a geographic location.
cl_kmeans3 <- eclust(data, k=3, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE) 
km3=fviz_cluster(cl_kmeans3, data = data, elipse.type = "convex", palette="jco") + theme_minimal()+ labs(title= "K-MEANS") + theme_dark() + theme(panel.background = element_rect(fill = "#2D2D2D"))

cl_pam3 <- eclust(data, k=3, FUNcluster="pam", hc_metric="pearson", graph=FALSE)
pa3=fviz_cluster(cl_pam3, data = data, elipse.type = "convex", palette="jco") + theme_minimal()+ labs(title= "PAM") + theme_dark() + theme(panel.background = element_rect(fill = "#2D2D2D"))

grid.arrange(km3, pa3, ncol=2)

Now let’s visualize the data

Visualization of clusters on the map

The first image shows a map with two groups obtained, while the second one shows three groups that I wanted to visualize despite the statistics (wss,silthouette, k-means, pam).
Thanks to the visualization of the clusters, we can see that the grouped data corresponds to the geographical location of Poland.

Summary