My research is dedicated to identify significant groups of UK citizens by their political and economic preferences on the eve of Brexit. For this purpose, we will carry out cluster analysis. Were chosen variables connected with feeling closer to a particular party, voting in the last election, satisfaction with economy and democracy in country. I suppose that the commitment to the party will be connected with voting (or not) in the election and, moreover, it is interesting to know how satisfied British people with present state (economic and political situation) in their own country.
df <- read_spss("ESS7GB.sav") #2264 obs. of 601 variables
df1 <- df[, c(26, 58, 83, 85, 158)] #choose 5 variables
#It's interesting that we have NAs mostly in variables about the satisfacion with economy and democracy in country. I think, such questions may be sensitive what caused the lack of response. Also, there are few NAs in variable about voting (I guess, the same reason)
df1 <- na.omit(df1) #It becomes 2145 obs. instead of 2264.
any(is.na(df1))
## [1] FALSE
df1 = dplyr::filter(df, ctzcntr == "1") #Now, we have only British citizens (2135 obs.)
df1 = dplyr::select(df1,"clsprty", "vote", "stfeco", "stfdem" )
feature_name <- c("close.party", "vote", "sat.econom", "sat.dem")
colnames(df1) <- feature_name
str(df1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2135 obs. of 4 variables:
## $ close.party:Class 'haven_labelled' atomic [1:2135] 2 2 1 2 2 2 1 2 2 1 ...
## .. ..- attr(*, "label")= chr "Feel closer to a particular party than all other parties"
## .. ..- attr(*, "labels")= Named num [1:5] 1 2 7 8 9
## .. .. ..- attr(*, "names")= chr [1:5] "Yes" "No" "Refusal" "Don't know" ...
## $ vote :Class 'haven_labelled' atomic [1:2135] 1 1 1 1 2 2 1 2 2 2 ...
## .. ..- attr(*, "label")= chr "Voted last national election"
## .. ..- attr(*, "labels")= Named num [1:6] 1 2 3 7 8 9
## .. .. ..- attr(*, "names")= chr [1:6] "Yes" "No" "Not eligible to vote" "Refusal" ...
## $ sat.econom :Class 'haven_labelled' atomic [1:2135] 5 2 0 6 1 1 6 7 7 7 ...
## .. ..- attr(*, "label")= chr "How satisfied with present state of economy in country"
## .. ..- attr(*, "labels")= Named num [1:14] 0 1 2 3 4 5 6 7 8 9 ...
## .. .. ..- attr(*, "names")= chr [1:14] "Extremely dissatisfied" "1" "2" "3" ...
## $ sat.dem :Class 'haven_labelled' atomic [1:2135] 7 6 3 7 1 1 5 7 4 7 ...
## .. ..- attr(*, "label")= chr "How satisfied with the way democracy works in country"
## .. ..- attr(*, "labels")= Named num [1:14] 0 1 2 3 4 5 6 7 8 9 ...
## .. .. ..- attr(*, "names")= chr [1:14] "Extremely dissatisfied" "1" "2" "3" ...
summary(df1)
## close.party vote sat.econom sat.dem
## Min. :1.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median :1.000 Median :1.000 Median : 5.000 Median : 5.000
## Mean :1.455 Mean :1.333 Mean : 4.626 Mean : 5.081
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.: 6.000 3rd Qu.: 7.000
## Max. :2.000 Max. :3.000 Max. :10.000 Max. :10.000
## NA's :8 NA's :7 NA's :36 NA's :72
clsprty - Feel closer to a particular party than all other parties - nominal; 1 - Yes, 2 - No
vote - Voted last national election - nominal; 1 - Yes, 2 - No, 3 - Not eligible to vote
stfeco - How satisfied with present state of economy in country - ordinal; from 0 to 10 (from less satisfies to more).
stfdem - How satisfied with the way democracy works in country - ordinal; from 0 to 10 (from less satisfies to more).
df1$close.party = as.factor(df1$close.party)
df1$vote = as.factor(df1$vote)
df1$sat.dem = as.numeric(df1$sat.dem)
df1$sat.econom = as.numeric(df1$sat.econom)
df1 <- na.omit(df1)
str(df1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2030 obs. of 4 variables:
## $ close.party: Factor w/ 2 levels "1","2": 2 2 1 2 2 2 1 2 2 1 ...
## $ vote : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 2 1 2 2 2 ...
## $ sat.econom : num 5 2 0 6 1 1 6 7 7 7 ...
## $ sat.dem : num 7 6 3 7 1 1 5 7 4 7 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:105] 21 27 31 87 89 98 108 134 160 162 ...
## .. ..- attr(*, "names")= chr [1:105] "21" "27" "31" "87" ...
summary(df1)
## close.party vote sat.econom sat.dem
## 1:1127 1:1456 Min. : 0.000 Min. : 0.000
## 2: 903 2: 506 1st Qu.: 3.000 1st Qu.: 3.000
## 3: 68 Median : 5.000 Median : 5.000
## Mean : 4.633 Mean : 5.077
## 3rd Qu.: 6.000 3rd Qu.: 7.000
## Max. :10.000 Max. :10.000
We have mixed type: 2 factor variables (close.party and vote) and 2 numeric (sat.econom and sat.dem).
According to our mixed type data, we use Gower distance and then -> PAM (partitioning around medoids)
gower_dist <- daisy(df1,
metric = "gower",
type = list(logratio = 3))
summary(gower_dist)
## 2059435 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1667 0.4000 0.3999 0.5667 1.0000
## Metric : mixed ; Types = N, N, I, I
## Number of objects : 2030
Firstly, we determine the number of clusters usyng silhouette width.
gower_mat <- as.matrix(gower_dist)
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(gower_dist,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
plot(1:10, sil_width,
xlab = "Number of clusters", xaxt='n',
ylab = "Silhouette Width")
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)
According to the graph, to choose 4 cluster would be the best solution, it has the highest point of width.
pam_fit <- pam(gower_dist, diss = TRUE, k = 4)
pam_results <- df1 %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results$the_summary
## [[1]]
## close.party vote sat.econom sat.dem cluster
## 1: 0 1:525 Min. : 0.000 Min. : 0.000 Min. :1
## 2:555 2: 0 1st Qu.: 3.000 1st Qu.: 4.000 1st Qu.:1
## 3: 30 Median : 5.000 Median : 5.000 Median :1
## Mean : 4.706 Mean : 5.166 Mean :1
## 3rd Qu.: 6.000 3rd Qu.: 7.000 3rd Qu.:1
## Max. :10.000 Max. :10.000 Max. :1
##
## [[2]]
## close.party vote sat.econom sat.dem cluster
## 1:952 1:931 Min. : 0.000 Min. : 0.000 Min. :2
## 2: 0 2: 0 1st Qu.: 3.000 1st Qu.: 4.000 1st Qu.:2
## 3: 21 Median : 5.000 Median : 6.000 Median :2
## Mean : 4.888 Mean : 5.513 Mean :2
## 3rd Qu.: 7.000 3rd Qu.: 7.000 3rd Qu.:2
## Max. :10.000 Max. :10.000 Max. :2
##
## [[3]]
## close.party vote sat.econom sat.dem cluster
## 1: 0 1: 0 Min. : 0.000 Min. : 0.000 Min. :3
## 2:348 2:338 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.:3
## 3: 10 Median : 4.000 Median : 4.000 Median :3
## Mean : 4.057 Mean : 4.086 Mean :3
## 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.:3
## Max. :10.000 Max. :10.000 Max. :3
##
## [[4]]
## close.party vote sat.econom sat.dem cluster
## 1:175 1: 0 Min. :0.000 Min. : 0.000 Min. :4
## 2: 0 2:168 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.:4
## 3: 7 Median :4.000 Median : 4.000 Median :4
## Mean :4.154 Mean : 4.394 Mean :4
## 3rd Qu.:6.000 3rd Qu.: 7.000 3rd Qu.:4
## Max. :9.000 Max. :10.000 Max. :4
Let’s look inside our clusters!
CLUSTER 1 (555 obj.): there are respondents not feeling closer to particular party and voting in the last election with the average levels of satisfaction with economy and democracy. However, comparing with others cluster these respondents seems more satisfied than others (esp. democracy).
CLUSTER 2 (952 obj.): it is the largest cluster including respondents feeling closer to particular party and voting in the last election. Here is observed the highest level of satisfaction with democracy and economy among others (esp. democracy).
CLUSTER 3 (348 obj.): it consists of respondents not feeling closer to particular party and not voting in the last election with below the average levels of satisfaction with economy and democracy (approx. at the same level). Among other clusters these respondents least satisfied both with ecomy amd democracy.
CLUSTER 4 (175 obj.): it is the smallest cluster inclusive respondents feeling closer to particular party and not voting in the last election. Their levels of satisfaction with economy and democracy is close to Cluster 3 - below the average.
According to that descriprion, I can suppose that Cluster 1, 2 contrasted with Cluster 3,4 by voting in the last election and levels of satisfaction with economy amd democracy. It is worth noting that I decide not removing from “vote” variable those respondents who are not eligible to vote (but they are still citizens of GB). It was interesting to me looking which cluster contains more such respondents. As a result, Cluster 1 and Cluster 2 (where the majority voted and levels of satisfaction higher than in other clusters) include more respondents not eligible to vote than Cluster 3 and Cluster 4.
library(Rtsne)
tsne_obj <- Rtsne::Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering))
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster))
According to the plot, there is no clear distinction on clusters. Even so, I can see that Cluster 2 is the biggest (as we said above) and scattered in large “dots” groups (around 5) locating in center. And the smallest Cluster 4 “linearly” located relatively close to each other.
Let’s start again with the identification of clusters’ number by creating a heatmap.
heatmap(as.matrix(gower_dist), symm = T,
distfun = function(x) as.dist(x))
At first glance, it seems we have 4 big cluster like in PAM, however, there is one more small cluster which I think can also be included. Overall, we have 5 clusters by looking closer at the heatmap.
Before clustering we will find more optimal method by compute the Mantel criteria.
library(vegan)
hc_list <- list(hc1 <- hclust(gower_dist,"complete"), hc2 <- hclust(gower_dist,"average"), hc3 <- hclust(gower_dist, "ward.D2"))
Coph <- rbind(MantelStat <- unlist(lapply(hc_list, function (hc) mantel(gower_dist, cophenetic(hc))$statistic)), MantelP <- unlist(lapply(hc_list, function (hc) mantel(gower_dist, cophenetic(hc))$signif)))
colnames(Coph) <- c("Complete","Average","Ward.D2")
rownames(Coph) <- c("W","Р-value")
round(Coph, 3)
## Complete Average Ward.D2
## W 0.864 0.870 0.845
## Р-value 0.001 0.001 0.001
An hour later… We got the table comparing various clustering methods. All methods are significant by p-value and according to the Mantel criteria (W) - “average” method is better for our clustering, its value more than others (0.87).
hclust_avg <- hclust(gower_dist, method = 'average')
plot(as.dendrogram(hclust_avg), xlab = "", sub="", ylab = ,
main = "Dendrogram")
cut_avg <- cutree(hclust_avg, k = 5)
Let’s see if it was a good decisions to make 5 clusters…
cl <- cutree(hclust_avg, k = 5)
df2 <- mutate(df1, cluster = cl)
describeBy(df2, df2$cluster)
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew
## close.party* 1 525 2.00 0.00 2 2.00 0.00 2 2 0 NaN
## vote* 2 525 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## sat.econom 3 525 4.66 2.07 5 4.71 2.97 0 10 10 -0.16
## sat.dem 4 525 5.09 2.34 5 5.14 2.97 0 10 10 -0.16
## cluster 5 525 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## kurtosis se
## close.party* NaN 0.00
## vote* NaN 0.00
## sat.econom -0.42 0.09
## sat.dem -0.48 0.10
## cluster NaN 0.00
## --------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range skew
## close.party* 1 931 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## vote* 2 931 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## sat.econom 3 931 4.89 2.18 5 4.98 2.97 0 10 10 -0.33
## sat.dem 4 931 5.49 2.48 6 5.59 2.97 0 10 10 -0.33
## cluster 5 931 2.00 0.00 2 2.00 0.00 2 2 0 NaN
## kurtosis se
## close.party* NaN 0.00
## vote* NaN 0.00
## sat.econom -0.48 0.07
## sat.dem -0.60 0.08
## cluster NaN 0.00
## --------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max range skew
## close.party* 1 338 2.00 0.00 2 2.0 0.00 2 2 0 NaN
## vote* 2 338 2.00 0.00 2 2.0 0.00 2 2 0 NaN
## sat.econom 3 338 4.06 2.24 4 4.1 2.97 0 10 10 -0.11
## sat.dem 4 338 4.12 2.49 4 4.1 2.97 0 10 10 0.02
## cluster 5 338 3.00 0.00 3 3.0 0.00 3 3 0 NaN
## kurtosis se
## close.party* NaN 0.00
## vote* NaN 0.00
## sat.econom -0.66 0.12
## sat.dem -0.63 0.14
## cluster NaN 0.00
## --------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max range skew
## close.party* 1 168 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## vote* 2 168 2.00 0.00 2 2.00 0.00 2 2 0 NaN
## sat.econom 3 168 4.21 2.36 4 4.24 2.97 0 9 9 -0.04
## sat.dem 4 168 4.48 2.58 4 4.53 2.97 0 10 10 -0.06
## cluster 5 168 4.00 0.00 4 4.00 0.00 4 4 0 NaN
## kurtosis se
## close.party* NaN 0.00
## vote* NaN 0.00
## sat.econom -0.67 0.18
## sat.dem -0.89 0.20
## cluster NaN 0.00
## --------------------------------------------------------
## group: 5
## vars n mean sd median trimmed mad min max range skew
## close.party* 1 68 1.59 0.50 2 1.61 0.00 1 2 1 -0.35
## vote* 2 68 3.00 0.00 3 3.00 0.00 3 3 0 NaN
## sat.econom 3 68 4.84 2.02 5 4.86 2.22 0 10 10 -0.06
## sat.dem 4 68 5.65 2.14 6 5.75 1.48 1 9 8 -0.44
## cluster 5 68 5.00 0.00 5 5.00 0.00 5 5 0 NaN
## kurtosis se
## close.party* -1.90 0.06
## vote* NaN 0.00
## sat.econom -0.48 0.24
## sat.dem -0.43 0.26
## cluster NaN 0.00
By hierarchical method were formed 5 clusters. Cluster 1-4 is the same with clusters resulting by PAM (even amount of objectives). Here I’m interesting in Cluster 5, let’s take a closer look at it:
CLUSTER 5 consists of 68 obj., so-called, private party :)
This cluster include respondents who are not eligible to vote - that’s great, because I want to find out more information about this group. Most of them don’t feel closer to a particular party than all other parties but there are some who does. Levels of their satisfaction with economy and democracy is higher than in other clusters (similar result only in Cluster 2) - average and above.
Let’s make our clustering more visual and colorful!
nodePar <- list(lab.cex = 0.6, pch = c(NA, 19), cex = 0.7, col = "blue")
hclust_dend <- as.dendrogram(hclust_avg)
plot(hclust_dend, xlab = "", sub="", ylab = "Euclidean distance",
main = "Dendrogram", nodePar = nodePar)
rect.dendrogram(hclust_dend , k=5, border="red")
avg_dend_obj <- as.dendrogram(hclust_avg)
hclust_dend2 <- color_branches(hclust_dend, h = 0.3)
plot(hclust_dend2)
Hierarchical vs Non-hierarchical
Having considered the clustered obtained using these methods, I can conclude that hierarchical clustering gives us more divided clusters. First of all, it is connected with groups of respondents who are not eligible to vote - in non-hierarchical were formed 4 clusters without a separate group of non-eligible to vote people. This could lead to “a mishmash” in clusters. Accordingly, I would suggest to choose Hierarchical method. Maybe, if we excluded its group from dataset, we would get 4 clusters by both methods and could use them equally well.