The aim of the study

My research is dedicated to identify significant groups of UK citizens by their political and economic preferences on the eve of Brexit. For this purpose, we will carry out cluster analysis. Were chosen variables connected with feeling closer to a particular party, voting in the last election, satisfaction with economy and democracy in country. I suppose that the commitment to the party will be connected with voting (or not) in the election and, moreover, it is interesting to know how satisfied British people with present state (economic and political situation) in their own country.

df <- read_spss("ESS7GB.sav") #2264 obs. of 601 variables

df1 <- df[, c(26, 58, 83, 85, 158)] #choose 5 variables

#It's interesting that we have NAs mostly in variables about the satisfacion with economy and democracy in country. I think, such questions may be sensitive what caused the lack of response. Also, there are few NAs in variable about voting (I guess, the same reason)
df1 <- na.omit(df1) #It becomes 2145 obs. instead of 2264. 
any(is.na(df1))

## [1] FALSE

df1 = dplyr::filter(df, ctzcntr == "1") #Now, we have only British citizens (2135 obs.)
df1 = dplyr::select(df1,"clsprty", "vote", "stfeco", "stfdem" )
feature_name <- c("close.party", "vote", "sat.econom", "sat.dem")
colnames(df1) <- feature_name
str(df1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2135 obs. of  4 variables:
##  $ close.party:Class 'haven_labelled'  atomic [1:2135] 2 2 1 2 2 2 1 2 2 1 ...
##   .. ..- attr(*, "label")= chr "Feel closer to a particular party than all other parties"
##   .. ..- attr(*, "labels")= Named num [1:5] 1 2 7 8 9
##   .. .. ..- attr(*, "names")= chr [1:5] "Yes" "No" "Refusal" "Don't know" ...
##  $ vote       :Class 'haven_labelled'  atomic [1:2135] 1 1 1 1 2 2 1 2 2 2 ...
##   .. ..- attr(*, "label")= chr "Voted last national election"
##   .. ..- attr(*, "labels")= Named num [1:6] 1 2 3 7 8 9
##   .. .. ..- attr(*, "names")= chr [1:6] "Yes" "No" "Not eligible to vote" "Refusal" ...
##  $ sat.econom :Class 'haven_labelled'  atomic [1:2135] 5 2 0 6 1 1 6 7 7 7 ...
##   .. ..- attr(*, "label")= chr "How satisfied with present state of economy in country"
##   .. ..- attr(*, "labels")= Named num [1:14] 0 1 2 3 4 5 6 7 8 9 ...
##   .. .. ..- attr(*, "names")= chr [1:14] "Extremely dissatisfied" "1" "2" "3" ...
##  $ sat.dem    :Class 'haven_labelled'  atomic [1:2135] 7 6 3 7 1 1 5 7 4 7 ...
##   .. ..- attr(*, "label")= chr "How satisfied with the way democracy works in country"
##   .. ..- attr(*, "labels")= Named num [1:14] 0 1 2 3 4 5 6 7 8 9 ...
##   .. .. ..- attr(*, "names")= chr [1:14] "Extremely dissatisfied" "1" "2" "3" ...

summary(df1)

##   close.party         vote         sat.econom        sat.dem      
##  Min.   :1.000   Min.   :1.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 3.000   1st Qu.: 3.000  
##  Median :1.000   Median :1.000   Median : 5.000   Median : 5.000  
##  Mean   :1.455   Mean   :1.333   Mean   : 4.626   Mean   : 5.081  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.: 6.000   3rd Qu.: 7.000  
##  Max.   :2.000   Max.   :3.000   Max.   :10.000   Max.   :10.000  
##  NA's   :8       NA's   :7       NA's   :36       NA's   :72

Variables

clsprty - Feel closer to a particular party than all other parties - nominal; 1 - Yes, 2 - No

vote - Voted last national election - nominal; 1 - Yes, 2 - No, 3 - Not eligible to vote

stfeco - How satisfied with present state of economy in country - ordinal; from 0 to 10 (from less satisfies to more).

stfdem - How satisfied with the way democracy works in country - ordinal; from 0 to 10 (from less satisfies to more).

df1$close.party = as.factor(df1$close.party)
df1$vote = as.factor(df1$vote)
df1$sat.dem = as.numeric(df1$sat.dem)
df1$sat.econom = as.numeric(df1$sat.econom)
df1 <- na.omit(df1)
str(df1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2030 obs. of  4 variables:
##  $ close.party: Factor w/ 2 levels "1","2": 2 2 1 2 2 2 1 2 2 1 ...
##  $ vote       : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 2 1 2 2 2 ...
##  $ sat.econom : num  5 2 0 6 1 1 6 7 7 7 ...
##  $ sat.dem    : num  7 6 3 7 1 1 5 7 4 7 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:105] 21 27 31 87 89 98 108 134 160 162 ...
##   .. ..- attr(*, "names")= chr [1:105] "21" "27" "31" "87" ...

summary(df1)

##  close.party vote       sat.econom        sat.dem      
##  1:1127      1:1456   Min.   : 0.000   Min.   : 0.000  
##  2: 903      2: 506   1st Qu.: 3.000   1st Qu.: 3.000  
##              3:  68   Median : 5.000   Median : 5.000  
##                       Mean   : 4.633   Mean   : 5.077  
##                       3rd Qu.: 6.000   3rd Qu.: 7.000  
##                       Max.   :10.000   Max.   :10.000

We have mixed type: 2 factor variables (close.party and vote) and 2 numeric (sat.econom and sat.dem).

Distance

According to our mixed type data, we use Gower distance and then -> PAM (partitioning around medoids)

gower_dist <- daisy(df1, 
                    metric = "gower",
                    type = list(logratio = 3))
summary(gower_dist)

## 2059435 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1667  0.4000  0.3999  0.5667  1.0000 
## Metric :  mixed ;  Types = N, N, I, I 
## Number of objects : 2030

Non-hierarchical clustering - PAM

Firstly, we determine the number of clusters usyng silhouette width.

gower_mat <- as.matrix(gower_dist)

sil_width <- c(NA)

for(i in 2:10){
  pam_fit <- pam(gower_dist,
                 diss = TRUE,
                 k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

plot(1:10, sil_width,
     xlab = "Number of clusters", xaxt='n',
     ylab = "Silhouette Width")
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)

According to the graph, to choose 4 cluster would be the best solution, it has the highest point of width.

Interpretation of clusters

pam_fit <- pam(gower_dist, diss = TRUE, k = 4)
pam_results <- df1 %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))
pam_results$the_summary

## [[1]]
##  close.party vote      sat.econom        sat.dem          cluster 
##  1:  0       1:525   Min.   : 0.000   Min.   : 0.000   Min.   :1  
##  2:555       2:  0   1st Qu.: 3.000   1st Qu.: 4.000   1st Qu.:1  
##              3: 30   Median : 5.000   Median : 5.000   Median :1  
##                      Mean   : 4.706   Mean   : 5.166   Mean   :1  
##                      3rd Qu.: 6.000   3rd Qu.: 7.000   3rd Qu.:1  
##                      Max.   :10.000   Max.   :10.000   Max.   :1  
## 
## [[2]]
##  close.party vote      sat.econom        sat.dem          cluster 
##  1:952       1:931   Min.   : 0.000   Min.   : 0.000   Min.   :2  
##  2:  0       2:  0   1st Qu.: 3.000   1st Qu.: 4.000   1st Qu.:2  
##              3: 21   Median : 5.000   Median : 6.000   Median :2  
##                      Mean   : 4.888   Mean   : 5.513   Mean   :2  
##                      3rd Qu.: 7.000   3rd Qu.: 7.000   3rd Qu.:2  
##                      Max.   :10.000   Max.   :10.000   Max.   :2  
## 
## [[3]]
##  close.party vote      sat.econom        sat.dem          cluster 
##  1:  0       1:  0   Min.   : 0.000   Min.   : 0.000   Min.   :3  
##  2:348       2:338   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.:3  
##              3: 10   Median : 4.000   Median : 4.000   Median :3  
##                      Mean   : 4.057   Mean   : 4.086   Mean   :3  
##                      3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.:3  
##                      Max.   :10.000   Max.   :10.000   Max.   :3  
## 
## [[4]]
##  close.party vote      sat.econom       sat.dem          cluster 
##  1:175       1:  0   Min.   :0.000   Min.   : 0.000   Min.   :4  
##  2:  0       2:168   1st Qu.:2.000   1st Qu.: 3.000   1st Qu.:4  
##              3:  7   Median :4.000   Median : 4.000   Median :4  
##                      Mean   :4.154   Mean   : 4.394   Mean   :4  
##                      3rd Qu.:6.000   3rd Qu.: 7.000   3rd Qu.:4  
##                      Max.   :9.000   Max.   :10.000   Max.   :4

Let’s look inside our clusters!

CLUSTER 1 (555 obj.): there are respondents not feeling closer to particular party and voting in the last election with the average levels of satisfaction with economy and democracy. However, comparing with others cluster these respondents seems more satisfied than others (esp. democracy).

CLUSTER 2 (952 obj.): it is the largest cluster including respondents feeling closer to particular party and voting in the last election. Here is observed the highest level of satisfaction with democracy and economy among others (esp. democracy).

CLUSTER 3 (348 obj.): it consists of respondents not feeling closer to particular party and not voting in the last election with below the average levels of satisfaction with economy and democracy (approx. at the same level). Among other clusters these respondents least satisfied both with ecomy amd democracy.

CLUSTER 4 (175 obj.): it is the smallest cluster inclusive respondents feeling closer to particular party and not voting in the last election. Their levels of satisfaction with economy and democracy is close to Cluster 3 - below the average.

According to that descriprion, I can suppose that Cluster 1, 2 contrasted with Cluster 3,4 by voting in the last election and levels of satisfaction with economy amd democracy. It is worth noting that I decide not removing from “vote” variable those respondents who are not eligible to vote (but they are still citizens of GB). It was interesting to me looking which cluster contains more such respondents. As a result, Cluster 1 and Cluster 2 (where the majority voted and levels of satisfaction higher than in other clusters) include more respondents not eligible to vote than Cluster 3 and Cluster 4.

Visualisation of clusters

library(Rtsne)
tsne_obj <- Rtsne::Rtsne(gower_dist, is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering))

ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster))

According to the plot, there is no clear distinction on clusters. Even so, I can see that Cluster 2 is the biggest (as we said above) and scattered in large “dots” groups (around 5) locating in center. And the smallest Cluster 4 “linearly” located relatively close to each other.

Hierarchical clustering

Let’s start again with the identification of clusters’ number by creating a heatmap.

heatmap(as.matrix(gower_dist), symm = T,
        distfun = function(x) as.dist(x))

At first glance, it seems we have 4 big cluster like in PAM, however, there is one more small cluster which I think can also be included. Overall, we have 5 clusters by looking closer at the heatmap.

Before clustering we will find more optimal method by compute the Mantel criteria.

library(vegan)
hc_list <- list(hc1 <- hclust(gower_dist,"complete"), hc2 <- hclust(gower_dist,"average"), hc3 <- hclust(gower_dist, "ward.D2"))

Coph <- rbind(MantelStat <- unlist(lapply(hc_list, function (hc) mantel(gower_dist, cophenetic(hc))$statistic)), MantelP <- unlist(lapply(hc_list, function (hc) mantel(gower_dist, cophenetic(hc))$signif)))


colnames(Coph) <- c("Complete","Average","Ward.D2")

rownames(Coph) <- c("W","Р-value")

round(Coph, 3)

##         Complete Average Ward.D2
## W          0.864   0.870   0.845
## Р-value    0.001   0.001   0.001

An hour later… We got the table comparing various clustering methods. All methods are significant by p-value and according to the Mantel criteria (W) - “average” method is better for our clustering, its value more than others (0.87).

Visualisation via dendogram

hclust_avg <- hclust(gower_dist, method = 'average')
plot(as.dendrogram(hclust_avg), xlab = "", sub="", ylab = ,
     main = "Dendrogram")

cut_avg <- cutree(hclust_avg, k = 5)

Let’s see if it was a good decisions to make 5 clusters…

Interpretation of clusters

cl <- cutree(hclust_avg, k = 5)
df2 <- mutate(df1, cluster = cl) 
describeBy(df2, df2$cluster)

## 
##  Descriptive statistics by group 
## group: 1
##              vars   n mean   sd median trimmed  mad min max range  skew
## close.party*    1 525 2.00 0.00      2    2.00 0.00   2   2     0   NaN
## vote*           2 525 1.00 0.00      1    1.00 0.00   1   1     0   NaN
## sat.econom      3 525 4.66 2.07      5    4.71 2.97   0  10    10 -0.16
## sat.dem         4 525 5.09 2.34      5    5.14 2.97   0  10    10 -0.16
## cluster         5 525 1.00 0.00      1    1.00 0.00   1   1     0   NaN
##              kurtosis   se
## close.party*      NaN 0.00
## vote*             NaN 0.00
## sat.econom      -0.42 0.09
## sat.dem         -0.48 0.10
## cluster           NaN 0.00
## -------------------------------------------------------- 
## group: 2
##              vars   n mean   sd median trimmed  mad min max range  skew
## close.party*    1 931 1.00 0.00      1    1.00 0.00   1   1     0   NaN
## vote*           2 931 1.00 0.00      1    1.00 0.00   1   1     0   NaN
## sat.econom      3 931 4.89 2.18      5    4.98 2.97   0  10    10 -0.33
## sat.dem         4 931 5.49 2.48      6    5.59 2.97   0  10    10 -0.33
## cluster         5 931 2.00 0.00      2    2.00 0.00   2   2     0   NaN
##              kurtosis   se
## close.party*      NaN 0.00
## vote*             NaN 0.00
## sat.econom      -0.48 0.07
## sat.dem         -0.60 0.08
## cluster           NaN 0.00
## -------------------------------------------------------- 
## group: 3
##              vars   n mean   sd median trimmed  mad min max range  skew
## close.party*    1 338 2.00 0.00      2     2.0 0.00   2   2     0   NaN
## vote*           2 338 2.00 0.00      2     2.0 0.00   2   2     0   NaN
## sat.econom      3 338 4.06 2.24      4     4.1 2.97   0  10    10 -0.11
## sat.dem         4 338 4.12 2.49      4     4.1 2.97   0  10    10  0.02
## cluster         5 338 3.00 0.00      3     3.0 0.00   3   3     0   NaN
##              kurtosis   se
## close.party*      NaN 0.00
## vote*             NaN 0.00
## sat.econom      -0.66 0.12
## sat.dem         -0.63 0.14
## cluster           NaN 0.00
## -------------------------------------------------------- 
## group: 4
##              vars   n mean   sd median trimmed  mad min max range  skew
## close.party*    1 168 1.00 0.00      1    1.00 0.00   1   1     0   NaN
## vote*           2 168 2.00 0.00      2    2.00 0.00   2   2     0   NaN
## sat.econom      3 168 4.21 2.36      4    4.24 2.97   0   9     9 -0.04
## sat.dem         4 168 4.48 2.58      4    4.53 2.97   0  10    10 -0.06
## cluster         5 168 4.00 0.00      4    4.00 0.00   4   4     0   NaN
##              kurtosis   se
## close.party*      NaN 0.00
## vote*             NaN 0.00
## sat.econom      -0.67 0.18
## sat.dem         -0.89 0.20
## cluster           NaN 0.00
## -------------------------------------------------------- 
## group: 5
##              vars  n mean   sd median trimmed  mad min max range  skew
## close.party*    1 68 1.59 0.50      2    1.61 0.00   1   2     1 -0.35
## vote*           2 68 3.00 0.00      3    3.00 0.00   3   3     0   NaN
## sat.econom      3 68 4.84 2.02      5    4.86 2.22   0  10    10 -0.06
## sat.dem         4 68 5.65 2.14      6    5.75 1.48   1   9     8 -0.44
## cluster         5 68 5.00 0.00      5    5.00 0.00   5   5     0   NaN
##              kurtosis   se
## close.party*    -1.90 0.06
## vote*             NaN 0.00
## sat.econom      -0.48 0.24
## sat.dem         -0.43 0.26
## cluster           NaN 0.00

By hierarchical method were formed 5 clusters. Cluster 1-4 is the same with clusters resulting by PAM (even amount of objectives). Here I’m interesting in Cluster 5, let’s take a closer look at it:

CLUSTER 5 consists of 68 obj., so-called, private party :)

This cluster include respondents who are not eligible to vote - that’s great, because I want to find out more information about this group. Most of them don’t feel closer to a particular party than all other parties but there are some who does. Levels of their satisfaction with economy and democracy is higher than in other clusters (similar result only in Cluster 2) - average and above.

Visualisation of clusters

Let’s make our clustering more visual and colorful!

nodePar <- list(lab.cex = 0.6, pch = c(NA, 19), cex = 0.7, col = "blue")
hclust_dend <- as.dendrogram(hclust_avg)
plot(hclust_dend, xlab = "", sub="", ylab = "Euclidean distance",
     main = "Dendrogram", nodePar = nodePar)
rect.dendrogram(hclust_dend , k=5, border="red")

avg_dend_obj <- as.dendrogram(hclust_avg)
hclust_dend2 <- color_branches(hclust_dend, h = 0.3)
plot(hclust_dend2)

Conclusion

Hierarchical vs Non-hierarchical

Having considered the clustered obtained using these methods, I can conclude that hierarchical clustering gives us more divided clusters. First of all, it is connected with groups of respondents who are not eligible to vote - in non-hierarchical were formed 4 clusters without a separate group of non-eligible to vote people. This could lead to “a mishmash” in clusters. Accordingly, I would suggest to choose Hierarchical method. Maybe, if we excluded its group from dataset, we would get 4 clusters by both methods and could use them equally well.

Project2

Sofia Andrienko

10 03 2019

The aim of the study

Variables

Distance

Non-hierarchical clustering - PAM

Interpretation of clusters

Visualisation of clusters

Hierarchical clustering

Visualisation via dendogram

Interpretation of clusters

Visualisation of clusters

Conclusion