Vaseva Cluster Analysis

On the 23rd June 2016 the UK public voted to leave the EU with a 52% majority. What factors can influence on the Brexit vote? Public figures and reporters speculated about these drivers. Some of these speculations included: that people felt “left behind”, that people felt uncomfortable with the social changes in Britain, and that people felt an “us vs. them” divide between British people and immigrants (Vyver J., 2018). Research also shows that conservatism is associated with intolerance of ambiguity, need for closure, and uncertainty avoidance that can also influence on the voting decision (Webster, Kruglanski, 1994).

In his work I would like to identify groups of UK citizens by their political and economic preferences in 2014. What kind of people could vote for Brexit and who are the people could vote for Bremain? I’d like to see how UK citizens will gain in clusters, according to their left–right political spectrum (left-wing, center, right-wing politics), attitudes towards the prospect of further European integration, satisfaction with the way how democracy works and with the present state of the economy in UK.

library(haven)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(corrplot)

## corrplot 0.84 loaded

library(NbClust)
library(cluster)
library(magrittr)
library(factoextra)

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

library(magrittr)
library(reshape2)
library(ca)
library(FactoMineR)

data0 <- read_sav("ESS7GB.sav")
data <- data0[, c(81, 83, 85, 90)] # selecting the data
data <- na.omit(data)
summary (data)

##     lrscale           stfeco           stfdem           euftf       
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.000   1st Qu.: 3.000   1st Qu.: 3.000   1st Qu.: 2.000  
##  Median : 5.000   Median : 5.000   Median : 5.000   Median : 4.000  
##  Mean   : 5.029   Mean   : 4.736   Mean   : 5.231   Mean   : 3.774  
##  3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.: 7.000   3rd Qu.: 5.000  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000

Selected variables:

1.lrscale - “Placement on left right scale” (ordinal) 2.stfeco - “How satisfied with present state of economy in country” (ordinal) 3.stfdem - “How satisfied with the way democracy works in country” (ordinal) 4.euftf - “European Union: European unification go further or gone too far” (ordinal)

All the questions have the same scale from 0 to 10 that is why normalization is not needed for further clustering building. Also, there are no outliers because all our variables are within these values.

When building clusters for the first time, I also included variables about membership of a particular political party and attitudes towards migrants. 5.prtclbgb - “Which party feel closer to, United Kingdom” 6.imdfetn - “Allow many/few immigrants of different race/ethnic group from majority” However, the resulting clusters intersect strongly, and I wanted to ensure that they were separated as much as possible on our data.

ggplot() + geom_bar(data = data, aes(x = lrscale))

## Don't know how to automatically pick scale for object of type haven_labelled. Defaulting to continuous.

According to the graps above the majority of UK population identify themselves as centrist (or they simply do not identify themselves to the left or right-wing politics and therefore choose the neutral option).

ggplot() + geom_bar(data = data, aes(x = stfeco))

## Don't know how to automatically pick scale for object of type haven_labelled. Defaulting to continuous.

ggplot() + geom_bar(data = data, aes(x = stfdem))

## Don't know how to automatically pick scale for object of type haven_labelled. Defaulting to continuous.

Approximately equal shares of the population are satisfied and dissatisfied with the politics and economy in the country.

ggplot() + geom_bar(data = data, aes(x = euftf))

## Don't know how to automatically pick scale for object of type haven_labelled. Defaulting to continuous.

We can see from the plot above that more UK citizens think that European unification gone too far than go further. A large proportion of respondents answered “0”. It means that they are 100% sure that they do not want unification to continue.

c = cor(data)
corrplot(c, order = "hclust")

According to the questions, there is a possibility that we have some variables with correlation. Variables “stfdem” and “stfeco” are correlated. But it no more than 0,7 so it is ok. So we can say that our variables slightly correlate, but it is fine for our future cluster building.

All our variables are ordinal that is why we use Euclidean distance matrix.

eucl <- dist(data, method = "euclidean")
fviz_dist(eucl)

Euclidean distance - measure for a distance between objects. Blue cells identify the small distance between pairs of observations, while red cells identify most distant pairs of citizens. The matrix is very big, so it is hard to say something about distances between observatons and possible future clusters.

heatmap(as.matrix(eucl), symm = T, distfun = function(x) as.dist(x))

Looks like we can retain 5 clusters here… but the next method says that the best number of clusters is 2 (for hierachical clustering).

set.seed(11)
nbclust <- data %>% NbClust(distance = "euclidean", min.nc = 2, max.nc = 10, method = "complete", index ="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 4 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 6 proposed 7 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

#"According to the majority rule, the best number of clusters is  2".

Hierarchical clustering

#First - calculate the distance matrix. For computing the distance matrix I use function "dist" because it is OK for either continuous or categorical variables. And then I compute hierachical clustering. Ward’s minimum variance criterion minimizes the total within-cluster variance.
hier <- data %>%
  dist(method = "euclidean") %>% 
  hclust(method = "ward.D2") 
cl <- cutree(hier, k = 2)
data3 <- mutate(data, cluster = cl) # added cluster membership as a column to the data frame
data3 %>%
  group_by(cluster) %>% 
  summarise_all(funs(mean(.))) # mean values of all variables by cluster

## # A tibble: 2 x 5
##   cluster lrscale stfeco stfdem euftf
##     <int>   <dbl>  <dbl>  <dbl> <dbl>
## 1       1    3.93   3.43   3.83  3.82
## 2       2    5.88   5.75   6.32  3.74

fviz_dend(hier, k = 2, # Cut in 2 groups
          cex = 0.5, # label size
          k_colors = c("#009999", "#0000FF"),
          color_labels_by_k = TRUE, # color labels 
          rect = TRUE) # rectangle around

## Warning in data.frame(xmin = unlist(xleft), ymin = unlist(ybottom), xmax =
## unlist(xright), : row names were found from a short variable and have been
## discarded

We have got 2 clusters. The blue cluster consist of people who consider themselves as more left in politics. They are not very satisfied with the present state of the economy and with the way democracy works in UK. And speaking about attitudes towards Political Europe, people in the this cluster are the most likely to report that European unification has already gone too far. The second yellow cluster consist of citizens with more right politics position. In British politics party considered centre-right is Conservatives. And possibly it is the reason why in this cluster people are quite satisfied with both present states of the economy and politics. As in the first cluster, UK citizens tend to think that European unification has already gone too far than it should go further.

fviz_nbclust(data, kmeans, method = "gap_stat")

Optimal number of clusters for k-mean is 3.

set.seed(11)
km <- kmeans(data, 3, nstart = 25)
# Visualize
fviz_cluster(km, data = data,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())

km$cluster

##    [1] 3 2 2 1 2 3 1 1 3 1 2 3 1 2 1 1 1 1 2 1 1 1 2 3 3 1 3 2 1 1 2 1 2 1
##   [35] 3 3 2 1 2 2 2 3 3 1 2 1 2 1 1 3 3 3 3 2 2 3 1 1 1 1 1 2 3 2 3 1 1 2
##   [69] 3 3 3 1 3 1 1 3 3 1 1 2 3 1 1 2 2 1 2 3 2 2 1 1 3 3 1 3 1 3 1 3 2 1
##  [103] 2 1 1 3 1 2 3 2 3 2 2 3 2 2 1 2 3 3 3 2 2 2 2 1 1 3 3 2 1 1 3 2 3 1
##  [137] 3 2 1 3 1 2 3 3 2 3 1 3 1 3 1 3 3 3 2 1 2 3 1 1 2 1 2 1 1 2 1 3 3 2
##  [171] 3 3 2 1 3 2 1 1 1 1 2 3 2 3 1 3 3 3 1 1 2 3 3 2 3 1 1 1 2 2 3 2 1 3
##  [205] 3 1 2 1 1 2 2 2 1 3 1 3 3 1 3 3 3 3 1 1 2 1 3 1 2 2 1 2 1 1 1 1 2 2
##  [239] 3 1 1 1 2 2 1 2 2 3 3 2 1 1 1 1 2 1 1 2 2 1 2 3 1 3 2 1 1 3 2 1 1 2
##  [273] 1 3 2 1 2 3 2 1 2 3 2 3 1 1 2 1 1 1 3 2 3 1 3 1 1 1 2 3 2 3 3 2 2 2
##  [307] 3 2 3 3 3 1 3 1 1 1 1 3 2 1 3 1 2 3 3 1 1 3 2 2 3 2 2 2 3 3 2 3 3 1
##  [341] 3 2 1 2 3 1 1 3 1 2 2 2 3 1 2 3 1 2 3 2 2 1 1 1 1 3 1 3 1 3 3 3 2 1
##  [375] 3 3 2 1 1 3 2 1 1 2 1 3 2 3 3 2 3 2 3 3 3 3 2 2 1 1 3 1 1 1 2 2 2 1
##  [409] 1 3 1 3 3 1 1 1 2 2 2 1 2 1 2 1 1 3 3 1 1 3 3 3 1 1 1 1 3 3 1 1 1 1
##  [443] 1 3 2 2 1 2 2 3 3 3 1 2 1 1 3 1 2 2 3 1 2 3 1 1 3 3 2 1 1 1 3 1 1 1
##  [477] 1 1 1 1 3 3 1 2 1 2 1 2 2 2 1 1 3 2 1 2 1 1 3 1 1 3 1 3 2 2 2 3 1 2
##  [511] 1 2 1 1 2 1 2 1 1 3 2 3 1 2 2 3 1 3 1 1 1 2 3 2 2 3 3 3 3 1 2 2 1 3
##  [545] 1 2 2 3 3 2 2 1 2 1 3 3 1 3 3 1 2 3 1 2 3 1 3 3 1 1 1 2 3 3 2 2 2 1
##  [579] 3 2 2 1 2 1 3 1 1 2 2 2 1 2 1 1 2 1 2 3 1 1 3 1 1 3 2 3 3 3 3 3 1 2
##  [613] 1 3 1 1 2 1 2 2 2 1 2 3 1 1 3 2 2 3 1 1 1 3 1 3 1 1 2 2 1 3 3 1 3 1
##  [647] 2 1 1 2 3 1 2 2 1 1 1 1 2 1 3 3 2 1 1 2 2 1 2 1 2 1 3 2 2 2 2 2 2 1
##  [681] 1 2 1 2 1 2 3 1 1 2 2 3 3 3 3 1 1 1 3 1 3 3 2 2 1 1 3 1 2 3 2 1 2 2
##  [715] 2 1 3 3 1 1 1 2 1 2 3 2 3 1 2 3 2 2 1 1 1 1 2 1 2 2 2 3 2 3 1 2 3 1
##  [749] 1 2 3 2 1 1 1 2 1 2 1 2 3 3 2 2 1 2 2 3 2 1 1 1 1 2 2 2 1 3 3 1 3 1
##  [783] 2 1 1 3 2 2 2 2 2 3 3 1 2 3 1 3 2 3 2 3 1 2 2 3 2 1 1 2 2 1 3 1 2 1
##  [817] 1 3 3 3 2 1 2 2 1 2 3 3 3 3 2 2 1 3 3 1 1 1 2 3 1 2 3 1 1 1 3 2 1 2
##  [851] 2 2 3 1 3 2 2 2 3 2 1 2 1 2 1 1 2 1 3 1 3 1 1 1 3 2 1 1 3 2 1 1 2 2
##  [885] 3 3 1 1 1 3 1 2 3 3 1 1 1 2 2 3 1 3 3 1 2 1 2 1 2 1 2 2 3 3 3 1 3 3
##  [919] 1 2 2 1 2 1 1 1 3 1 1 3 1 1 2 2 1 3 3 1 1 1 3 1 2 3 2 1 2 2 3 1 2 2
##  [953] 2 1 3 1 3 2 2 1 3 3 3 3 2 2 2 2 2 2 1 3 1 1 2 3 1 2 1 3 1 1 3 3 1 2
##  [987] 3 3 1 3 3 1 1 3 1 1 3 2 2 3 3 1 1 3 1 3 1 2 2 1 2 1 2 3 3 1 1 3 3 1
## [1021] 3 3 1 1 2 1 1 2 1 3 1 3 1 2 1 2 2 3 2 1 2 2 3 2 1 1 1 3 1 2 2 1 2 3
## [1055] 3 2 1 1 2 3 3 2 1 3 2 3 2 1 3 1 1 2 2 2 2 2 1 2 1 1 1 2 3 3 1 2 1 2
## [1089] 1 2 1 1 1 1 3 2 1 1 3 1 1 2 1 3 2 2 3 1 2 1 3 3 1 1 3 2 3 1 1 3 3 1
## [1123] 3 2 3 1 1 2 3 1 3 3 3 3 3 3 1 3 1 3 3 3 3 1 1 2 2 2 3 1 1 1 3 3 3 1
## [1157] 2 1 1 3 3 1 2 1 1 3 3 1 1 1 1 3 1 1 2 3 2 3 3 2 3 3 1 2 3 3 3 2 2 2
## [1191] 3 3 1 1 2 2 3 1 1 3 2 1 1 3 2 2 2 1 3 3 3 3 3 2 1 3 1 3 1 1 2 1 3 2
## [1225] 1 2 3 2 2 2 2 3 1 2 1 1 1 1 1 2 3 2 3 3 3 3 3 1 2 2 3 3 2 3 3 2 1 2
## [1259] 3 1 1 1 2 3 1 2 3 2 1 2 1 3 3 1 1 2 1 2 3 2 3 1 2 1 2 2 1 3 3 3 1 3
## [1293] 3 3 3 3 3 1 1 3 2 3 3 3 1 3 2 2 1 2 2 2 1 3 2 1 3 1 2 1 3 1 2 3 1 3
## [1327] 2 3 2 2 1 3 2 1 3 1 1 1 3 2 2 2 1 1 2 1 1 3 1 1 1 2 1 3 2 3 2 2 1 1
## [1361] 2 1 2 2 1 3 3 1 1 1 3 3 1 2 1 2 1 1 1 3 2 3 1 1 3 1 1 2 1 2 3 1 2 3
## [1395] 3 1 1 2 2 1 1 1 1 3 2 2 3 2 3 3 1 3 2 1 3 1 1 2 1 1 1 1 1 3 2 3 3 2
## [1429] 2 3 2 1 3 1 2 2 1 2 1 1 2 2 3 1 2 1 1 2 3 3 2 1 1 1 3 3 1 2 2 1 1 1
## [1463] 2 3 1 2 1 3 1 1 1 3 3 2 1 3 1 1 1 1 1 1 2 2 3 2 1 2 1 2 1 1 3 2 2 3
## [1497] 1 1 1 3 2 2 3 3 3 2 2 3 1 3 1 3 3 1 2 2 2 2 1 1 3 1 1 3 3 1 2 1 3 1
## [1531] 3 3 3 2 1 1 3 2 2 3 1 3 1 2 3 1 2 1 2 3 3 2 2 2 3 3 3 3 3 3 2 3 2 3
## [1565] 3 1 1 3 1 2 1 3 1 3 1 1 2 1 3 3 2 1 3 1 1 1 3 1 2 3 3 2 2 3 1 2 2 2
## [1599] 3 1 1 1 2 3 3 1 3 3 2 2 2 1 1 3 3 1 1 3 1 3 1 1 3 3 3 1 1 2 2 2 2 3
## [1633] 2 2 2 1 3 3 3 3 1 2 2 1 2 2 3 1 1 2 1 1 2 3 2 2 3 2 1 2 1 1 1 2 2 3
## [1667] 1 3 1 3 1 1 1 2 1 2 1 1 1 1 2 1 1 2 2 2 2 3 3 1 2 1 1 3 1 1 2 3 2 2
## [1701] 2 1 1 2 3 1 2 2 2 2 1 3 3 2 2 1 1 1 2 2 2 2 1 2 1 2 3 1 2 2 1 3 2 2
## [1735] 1 2 1 2 1 3 2 1 1 1 1 3 3 1 2 3 3 1 1 3 3 3 1 1 2 3 2 1 1 1 1 1 1 1
## [1769] 2 1 2 3 2 3 1 2 2 1 3 1 1 1 1 2 3 2 3 1 2 1 1 1 1 1 1 1 1 2 3 3 1 1
## [1803] 1 3 2 2 1 3 1 3 1 2 1 1 2 3 1 2 3 1 2 2 3 1 2 2 3 3 1 3 1 2 1 1 2 1
## [1837] 1 3 2 2 1 2 2 3 3 2 1 1 1 1 1 2 2 1 3 3 2 2 2 1 2 3 2 2 1 2 3 2 1 2
## [1871] 1 1 3 1 2 2 3 2 2 2 3 3 3 3 3 3 1 1 3 1 1 1 1 1 1 2 3 3 1 1 1 2 2 3
## [1905] 2 1 2 1 3 3 3 2 1 1 3 2 3

table(km$cluster) #How many observations are in each cluster

## 
##   1   2   3 
## 752 594 571

aggregate(data,by=list(km$cluster),FUN=mean) # get cluster means

##   Group.1  lrscale   stfeco   stfdem    euftf
## 1       1 4.980053 5.597074 6.429521 5.720745
## 2       2 3.893939 2.644781 2.622896 3.346801
## 3       3 6.274956 5.777583 6.364273 1.654991

clusplot(data, km$cluster, main='2D representation of the Cluster solution',
         color=TRUE, shade=TRUE,
         labels=2, lines=0)

Cluster definition In k-means we have got 3 clusters. Blue cluster (the biggest one) consist of people that have center political views. They are quite satisfied with how democracy works in country and with economic situation. And they think that it will be better if European unification go further. The second yellow group is opposite in their views about democracy and present state of economy in country. They are not satisfied with how democracy works and with UK economy. This group is from left-wing politics and they have negative attitudes towards the prospect of further European integration. The third grey cluster is characterized by the confident position of the residents that the European unification has gone too far. Speaking about the political and economic situation in the country, we can say that they are quite satisfied with how democracy works and the state of the economy in the UK. People in this cluster have more right-wing views in politics. In my opinion, k-means better divided my variables into clusters. They are more accurate and make more sense than hierarchical clustering, which proposed two generalized clusters. Therefore, I stopped at the second method and will use it in the conclusions.

Correspondence analysis

In correspondence analysis we can use only categorical variables that’s why I’ve decided to recode my variables. I select 2 variables: placement on left right politics scale and position about European unification.

data2 <- data0[, c(81,  90)]
data2$lrscale  <- cut(data2$lrscale,
                     breaks=c(0, 4, 6, 10),
                     labels=c("left","center", "right"))
summary (data2$lrscale)

##   left center  right   NA's 
##    545    985    426    308

data2$euftf <- cut(data2$euftf,
                   breaks=c(0, 4.99, 5.01, 10),
                   labels=c("EU gone too far","don't know", "EU should go further"))
summary (data2$euftf)

##      EU gone too far           don't know EU should go further 
##                  891                  530                  402 
##                 NA's 
##                  441

data2 <- na.omit(data2)

#Create contingency table for the further analysis
ctable<-t(table(data2))
ctable

##                       lrscale
## euftf                  left center right
##   EU gone too far       198    409   194
##   don't know            144    264    71
##   EU should go further  141    162    57

# Chi-Square Test 
chisqresults <- chisq.test(ctable) 
chisqresults

## 
##  Pearson's Chi-squared test
## 
## data:  ctable
## X-squared = 38.701, df = 4, p-value = 8.031e-08

Chi-Squared test is significant: p-value = 2.316e-05, it means p-value < 0.05. Distribution between these categories significantly different. The row and the column variables are statistically signiﬁcantly associated.

library(ca)
library(FactoMineR) 
ca <- ca(ctable) 
names(ca)

##  [1] "sv"         "nd"         "rownames"   "rowmass"    "rowdist"   
##  [6] "rowinertia" "rowcoord"   "rowsup"     "colnames"   "colmass"   
## [11] "coldist"    "colinertia" "colcoord"   "colsup"     "N"         
## [16] "call"

summary(ca)

## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.018957  80.3  80.3  ********************     
##  2      0.004642  19.7 100.0  *****                    
##         -------- -----                                 
##  Total: 0.023598 100.0                                 
## 
## 
## Rows:
##     name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
## 1 | EUgn |  488 1000  379 | -133 973 459 |   22  27  52 |
## 2 | dntk |  292 1000  190 |   74 356  84 | -100 644 624 |
## 3 | EUsh |  220 1000  430 |  199 852 457 |   83 148 324 |
## 
## Columns:
##     name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
## 1 | left |  295 1000  457 |  184 922 525 |   53  78 181 |
## 2 | cntr |  509 1000  107 |  -25 128  17 |  -66 872 474 |
## 3 | rght |  196 1000  436 | -210 844 458 |   90 156 345 |

ca2 <- CA(ctable)

summary.CA(ca2)

## 
## Call:
## CA(X = ctable) 
## 
## The chi square of independence between the two variables is equal to 38.70089 (p-value =  8.031234e-08 ).
## 
## Eigenvalues
##                        Dim.1   Dim.2
## Variance               0.019   0.005
## % of var.             80.331  19.669
## Cumulative % of var.  80.331 100.000
## 
## Rows
##                        Iner*1000    Dim.1    ctr   cos2    Dim.2    ctr
## EU gone too far      |     8.946 |  0.133 45.909  0.973 |  0.022  5.250
## don't know           |     4.493 | -0.074  8.433  0.356 | -0.100 62.360
## EU should go further |    10.159 | -0.199 45.658  0.852 |  0.083 32.391
##                        cos2  
## EU gone too far       0.027 |
## don't know            0.644 |
## EU should go further  0.148 |
## 
## Columns
##                        Iner*1000    Dim.1    ctr   cos2    Dim.2    ctr
## left                 |    10.786 | -0.184 52.472  0.922 |  0.053 18.077
## center               |     2.522 |  0.025  1.705  0.128 | -0.066 47.381
## right                |    10.290 |  0.210 45.823  0.844 |  0.090 34.543
##                        cos2  
## left                  0.078 |
## center                0.872 |
## right                 0.156 |

On the map, we see that the left believes that European unification should go further. The right are closer to the answer “European unification gone too far”. And people with moderate political views are closer to the fact that they have not decided on their position.

fviz_screeplot(ca2)

We have got only 2 dimentions and they explain all variance.

Conclusion Considering which of the clustering maethods has shown the better results, I have to choose k-means. After conducting a cluster analysis, I identified three groups of UK citizens. In the first group, people want European unification to continue, they are satisfied with the democratic regime in the country and the state of the economy. These are people with moderate political views. These people most likely voted against Brexit. The second group I have selected consists of people with a center-left perspective (In British politics). These people express dissatisfaction with the work of democracy and the current state of the economy. They have a slightly negative attitude towards European unification. Most likely they would have voted for Brexit. And the third group with the right-centrist political views (I can assume that this is the Conservatives). This group is strongly convinced that European integration has gone too far. I think this group definitely voted for Brexit.