Data source: European Social Survey European Research Infrastructure Consortium (ESS ERIC)
ESS round 8 (2016) edition 2.3 (published 24.11.23). Welfare attitudes, Attitudes to climate change.
Research question:
What can be the classification of 15+ aged persons living in Slovenia using this data? Which insights about the population we can get from this data?
For such tasks we use clustering. It divides units into groups which we can use for finding following diffirencies and similarities as well as discover those groups separately to get more insights and understanding of their properties.
mydataSource <- read.table("./ess8_23.csv", header = TRUE, sep = ",", dec = ".", quote = "\"")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydataSource[c("idno", "wrclmch", "imprich", "ppltrst", "netustm", "polintr", "happy", "gndr", "cntry", "agea")]
names(mydata)[1] <- "ID"
names(mydata)[2] <- "ClimateChanging"
names(mydata)[3] <- "ImpRich"
names(mydata)[4] <- "PplTrust"
names(mydata)[5] <- "InternetUse"
names(mydata)[6] <- "PoliticsInt"
names(mydata)[7] <- "Happiness"
names(mydata)[8] <- "Gender"
names(mydata)[9] <- "Country"
names(mydata)[10] <- "Age"
mydata$GenderF <- factor(mydata$Gender, levels=c(1,2), labels=c("M", "F"))
Sampling:
Clusters of Enumeration Areas (CEA) and the Central Register of Population (CRP).
Two-stage probability sampling design with stratification at the first stage.
Probability: Multistage
Units: 15+ aged persons citizens of Slovenia.
Sample size: 1307 persons.
Variables:
ID
ClimateChanging: How worried are you about climate change? (1 - Not at all worried, 5 - Extremely worried)
ImpRich: It is important to her/him to be rich. She/he wants to have a lot of money and expensive things. (1 = It’s very much like me, 6 = Not like me at all)
PplTrust: Would you say that most people can be trusted, or that you can’t be too careful in dealing with people? (0 = You can’t be too careful, 10 = Most people can be trusted)
InternetUse: On a typical day, about how much time do you spend using the internet on a computer, tablet, smartphone or other device, whether for work or personal use? In minutes.
PoliticsInt: How interested would you say you are in politics - are you… (1 = Very interested, 4 = Not at all interested)
Happiness: Taking all things together, how happy would you say you are? (0 = Extremely unhappy, 10 = Extremely happy)
Gender: Respondent’s gender. (1 = Male, 2 = Female)
Country: Country code (SI = Slovenia)
Age: Respondent’s age in years.
# Filtering out units with unanswered questions and keeping only data for Slovenia
mydata <- mydata[mydata$Happiness < 11, ]
mydata <- mydata[mydata$ClimateChanging < 6, ]
mydata <- mydata[mydata$ImpRich < 7, ]
mydata <- mydata[mydata$PplTrust < 11, ]
mydata <- mydata[mydata$InternetUse < 6666, ]
mydata <- mydata[mydata$PoliticsInt < 5, ]
mydata <- mydata[mydata$Happiness < 11, ]
mydata <- mydata[mydata$Gender < 3, ]
mydata <- mydata[mydata$Gender < 900, ]
library(dplyr)
mydata <- filter(mydata, Country == "SI")
head(mydata)
## ID ClimateChanging ImpRich PplTrust InternetUse PoliticsInt Happiness Gender
## 1 1 3 6 5 30 1 10 2
## 2 2 3 6 7 300 3 9 2
## 3 3 2 3 3 120 3 9 1
## 4 4 3 3 4 60 3 7 1
## 5 7 2 6 2 120 3 9 1
## 6 9 4 6 5 240 3 8 2
## Country Age GenderF
## 1 SI 57 F
## 2 SI 41 F
## 3 SI 29 M
## 4 SI 44 M
## 5 SI 32 M
## 6 SI 29 F
“Not applicable” values of InternetUse value cuts off quite a lot of data. So it’s likely that we remove information about one of groups that could be represented in clustering. If we need to use the variable about internet usage it could be an option to use another variable (for example, “How often do you use the internet on these or any other devices, whether for work or personal use?” Never … Every day) measured on Likert scale.
But I’ve decided to skip this issue to have a numeric ratio variable among clustering variables.
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
round(stat.desc(mydata), 2)
## ID ClimateChanging ImpRich PplTrust InternetUse PoliticsInt
## nbr.val 747.00 747.00 747.00 747.00 747.00 747.00
## nbr.null 0.00 0.00 0.00 46.00 1.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00 0.00
## min 1.00 1.00 1.00 0.00 0.00 1.00
## max 1307.00 5.00 6.00 10.00 1440.00 4.00
## range 1306.00 4.00 5.00 10.00 1440.00 3.00
## sum 490770.00 2373.00 3182.00 3486.00 129885.00 2065.00
## median 656.00 3.00 5.00 5.00 120.00 3.00
## mean 656.99 3.18 4.26 4.67 173.88 2.76
## SE.mean 13.60 0.03 0.05 0.09 5.91 0.03
## CI.mean.0.95 26.70 0.06 0.09 0.17 11.60 0.06
## var 138143.50 0.62 1.54 5.71 26092.67 0.73
## std.dev 371.68 0.79 1.24 2.39 161.53 0.85
## coef.var 0.57 0.25 0.29 0.51 0.93 0.31
## Happiness Gender Country Age GenderF
## nbr.val 747.00 747.00 NA 747.00 NA
## nbr.null 2.00 0.00 NA 0.00 NA
## nbr.na 0.00 0.00 NA 0.00 NA
## min 0.00 1.00 NA 15.00 NA
## max 10.00 2.00 NA 94.00 NA
## range 10.00 1.00 NA 79.00 NA
## sum 5859.00 1136.00 NA 29618.00 NA
## median 8.00 2.00 NA 40.00 NA
## mean 7.84 1.52 NA 39.65 NA
## SE.mean 0.06 0.02 NA 0.56 NA
## CI.mean.0.95 0.12 0.04 NA 1.10 NA
## var 2.69 0.25 NA 236.54 NA
## std.dev 1.64 0.50 NA 15.38 NA
## coef.var 0.21 0.33 NA 0.39 NA
#Standardizing
mydata$ClimateChanging_z <- scale(mydata$ClimateChanging)
mydata$ImpRich_z <- scale(mydata$ImpRich)
mydata$InternetUse_z <- scale(mydata$InternetUse)
mydata$PplTrust_z <- scale(mydata$PplTrust)
mydata$PoliticsInt_z <- scale(mydata$PoliticsInt)
mydata$Happiness_z <- scale(mydata$Happiness)
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.3.2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[, c("ClimateChanging_z", "ImpRich_z", "InternetUse_z", "PplTrust_z", "PoliticsInt_z", "Happiness_z")]), type="pearson")
## ClimateChanging_z ImpRich_z InternetUse_z PplTrust_z
## ClimateChanging_z 1.00 -0.02 0.03 -0.01
## ImpRich_z -0.02 1.00 -0.11 0.01
## InternetUse_z 0.03 -0.11 1.00 0.05
## PplTrust_z -0.01 0.01 0.05 1.00
## PoliticsInt_z -0.02 -0.01 -0.01 -0.14
## Happiness_z -0.08 0.04 0.04 0.13
## PoliticsInt_z Happiness_z
## ClimateChanging_z -0.02 -0.08
## ImpRich_z -0.01 0.04
## InternetUse_z -0.01 0.04
## PplTrust_z -0.14 0.13
## PoliticsInt_z 1.00 0.04
## Happiness_z 0.04 1.00
##
## n= 747
##
##
## P
## ClimateChanging_z ImpRich_z InternetUse_z PplTrust_z
## ClimateChanging_z 0.5671 0.4617 0.6974
## ImpRich_z 0.5671 0.0029 0.6868
## InternetUse_z 0.4617 0.0029 0.1366
## PplTrust_z 0.6974 0.6868 0.1366
## PoliticsInt_z 0.5898 0.8548 0.8482 0.0000
## Happiness_z 0.0351 0.2544 0.2425 0.0002
## PoliticsInt_z Happiness_z
## ClimateChanging_z 0.5898 0.0351
## ImpRich_z 0.8548 0.2544
## InternetUse_z 0.8482 0.2425
## PplTrust_z 0.0000 0.0002
## PoliticsInt_z 0.2246
## Happiness_z 0.2246
Variables don’t correlate, it’s good for clustering.
After trying various combinations of variables for clustering I ended up on using only three of them:
Importance to be rich, Internet usage and interest in politics. So I will use these variables for classification.
We assume that there’s a theoretical basis for such way of classification.
mydata$Dissimilarity <- sqrt(mydata$ImpRich_z^2+mydata$InternetUse_z^2+mydata$PoliticsInt_z^2)
head(mydata[order(-mydata$Dissimilarity), c("Dissimilarity")], 20)
## [,1]
## [1,] 7.993320
## [2,] 4.926979
## [3,] 4.694895
## [4,] 4.138332
## [5,] 4.091771
## [6,] 3.995694
## [7,] 3.850469
## [8,] 3.548261
## [9,] 3.405282
## [10,] 3.405282
## [11,] 3.398627
## [12,] 3.345353
## [13,] 3.328939
## [14,] 3.320717
## [15,] 3.251921
## [16,] 3.188211
## [17,] 3.176468
## [18,] 3.176468
## [19,] 3.176468
## [20,] 3.119500
mydata <- filter(mydata, Dissimilarity < 3.8) # Removing first 3 units with high dissimilarity
## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
ggplot(mydata, aes(x = Dissimilarity)) +
geom_histogram(binwidth = 0.1, colour = "Black")
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#mydataSample <- mydata[seq(1, nrow(mydata), 10), ]
distance2 <- get_dist(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], method = "euclidian") ^ 2
fviz_dist(distance2)
get_clust_tendency(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], n=nrow(mydata)-1, graph = FALSE)
## $hopkins_stat
## [1] 0.7283589
##
## $plot
## NULL
Hopkins statistics is more than 0.5 so the data is suitable for classification
library(dplyr)
WARD <- mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")] %>%
get_dist(method = "euclidian") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 740
For the first stage we use WARD’s method and squared Euclidian distances.
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
After several combinations of variables and number of groups…
I’m cutting this dendrogram at level of 4 groups
mydata$clusterWard <- cutree(WARD, k = 4)
init_leaders <- aggregate(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], by = list(mydata$clusterWard), FUN = mean)
init_leaders
## Group.1 ImpRich_z InternetUse_z PoliticsInt_z
## 1 1 0.06416173 -0.4012664 -1.0952644
## 2 2 0.79410172 -0.3685200 0.6839276
## 3 3 -0.97510248 -0.2063242 0.6889884
## 4 4 -0.19388327 1.7531596 -0.3776804
KMeans <- hkmeans(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], k=4, hc.metric = "euclidean", hc.method = "ward.D2")
KMeans
## Hierarchical K-means clustering with 4 clusters of sizes 223, 286, 143, 88
##
## Cluster means:
## ImpRich_z InternetUse_z PoliticsInt_z
## 1 0.1955457 -0.3421615 -1.1014149
## 2 0.6081403 -0.3412554 0.6944857
## 3 -1.3933948 -0.1872885 0.4812825
## 4 -0.1544164 1.9355800 -0.2700538
##
## Clustering vector:
## [1] 1 2 3 3 2 2 1 2 2 1 2 2 4 2 1 3 3 2 1 4 2 3 2 2 1 1 1 1 3 1 1 1 3 2 1 1 3
## [38] 1 1 2 3 1 2 3 2 2 1 2 2 3 2 2 2 3 2 2 3 1 2 3 2 1 2 2 4 2 4 1 1 4 3 2 3 1
## [75] 1 1 3 2 2 1 2 2 1 1 3 3 2 3 1 2 4 1 4 1 2 1 1 1 4 1 1 1 3 2 4 3 4 3 3 3 2
## [112] 1 4 4 3 1 2 4 1 1 4 2 1 4 1 3 2 1 2 1 4 1 2 1 2 2 2 2 1 2 1 2 1 4 1 1 1 2
## [149] 2 2 2 2 1 3 1 2 2 3 3 4 3 4 1 2 2 2 1 2 1 1 3 3 2 2 1 1 3 3 4 4 2 1 3 3 3
## [186] 4 2 1 1 3 2 2 2 3 2 2 3 3 2 1 2 2 3 2 2 1 3 3 1 1 1 2 1 1 1 2 1 2 2 2 3 1
## [223] 2 2 2 1 4 1 2 2 2 2 2 2 4 1 1 2 1 3 1 1 2 3 4 3 2 1 4 2 4 3 1 2 1 3 2 4 2
## [260] 2 2 1 1 1 2 2 1 2 2 1 1 4 2 2 2 2 2 2 1 2 3 1 4 2 1 1 1 2 2 2 4 2 2 2 1 1
## [297] 2 4 3 1 3 1 1 1 1 1 2 3 2 2 3 2 3 2 2 2 2 2 3 2 2 1 2 3 2 2 2 1 1 1 2 2 4
## [334] 4 2 1 1 4 1 2 1 3 3 1 3 1 2 4 2 1 2 4 4 1 2 3 1 2 4 2 1 4 1 1 4 3 2 2 4 2
## [371] 1 2 3 4 4 2 4 4 1 3 1 4 3 2 4 4 1 3 1 3 2 3 1 3 2 1 1 3 2 3 2 2 1 2 2 2 3
## [408] 2 1 3 1 1 2 2 4 4 3 2 1 1 3 3 4 2 3 1 1 2 2 3 3 3 3 3 2 2 2 3 3 3 2 2 2 1
## [445] 1 2 3 4 3 3 4 3 2 1 3 2 2 2 1 1 1 1 2 2 3 4 2 3 1 4 3 2 3 1 1 2 2 2 3 2 2
## [482] 1 1 1 1 4 2 1 3 2 2 4 1 2 1 1 4 1 1 4 2 2 2 3 2 2 1 2 2 1 2 3 3 4 2 3 1 2
## [519] 2 2 2 1 2 2 2 1 3 2 2 2 4 1 2 4 1 2 2 3 2 1 2 1 1 1 2 2 3 2 2 1 3 2 1 2 2
## [556] 1 3 4 2 3 1 3 2 2 1 4 2 2 2 3 1 2 2 3 2 2 2 1 4 1 1 1 2 3 1 3 3 2 1 2 1 3
## [593] 3 2 1 2 4 4 1 3 3 4 3 3 2 4 3 2 2 2 2 2 1 3 1 1 3 1 2 2 2 1 2 1 2 1 2 4 2
## [630] 4 4 2 1 2 1 1 3 1 3 1 2 1 1 4 1 4 1 1 2 1 2 4 2 1 4 3 3 1 2 2 2 2 3 1 3 2
## [667] 2 1 2 4 4 1 2 2 3 1 2 2 4 2 3 1 3 2 1 1 2 1 3 3 1 2 2 3 1 3 3 2 1 4 4 2 2
## [704] 2 1 3 4 2 2 1 2 2 2 1 4 2 4 3 2 4 4 1 2 1 3 1 2 2 2 2 1 4 3 1 2 3 1 1 3 1
##
## Within cluster sum of squares by cluster:
## [1] 256.1372 232.3530 156.7989 181.6751
## (between_SS / total_SS = 59.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
fviz_cluster(KMeans, palette = "jama", repel = FALSE, ggtheme = theme_classic(), geom = c("point"))
fviz_cluster(KMeans, palette = "jama", repel = FALSE, ggtheme = theme_classic())
mydata[c(65,644,380,16),] # distinguishing outliers
## ID ClimateChanging ImpRich PplTrust InternetUse PoliticsInt Happiness
## 65 135 3 2 5 480 1 7
## 644 1134 3 2 7 600 2 7
## 380 668 5 1 5 480 3 10
## 16 36 3 2 9 480 4 10
## Gender Country Age GenderF ClimateChanging_z ImpRich_z InternetUse_z
## 65 1 SI 27 M -0.2243804 -1.821892 1.895129
## 644 2 SI 28 F -0.2243804 -1.821892 2.638015
## 380 1 SI 60 M 2.3151975 -2.628145 1.895129
## 16 1 SI 46 M -0.2243804 -1.821892 1.895129
## PplTrust_z PoliticsInt_z Happiness_z Dissimilarity clusterWard
## 65 0.1395228 -2.0689559 -0.5139662 3.345353 4
## 644 0.9766599 -0.8963382 -0.5139662 3.328939 4
## 380 0.1395228 0.2762794 1.3142850 3.251921 4
## 16 1.8137970 1.4488970 1.3142850 3.001685 4
mydata <- filter(mydata, ID !=135 & ID != 1134 & ID != 668 & ID != 36) # removing outliers and repeating the classification
mydata$ClimateChanging_z <- scale(mydata$ClimateChanging)
mydata$ImpRich_z <- scale(mydata$ImpRich)
mydata$InternetUse_z <- scale(mydata$InternetUse)
mydata$PplTrust_z <- scale(mydata$PplTrust)
mydata$PoliticsInt_z <- scale(mydata$PoliticsInt)
mydata$Happiness_z <- scale(mydata$Happiness)
WARD <- mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")] %>%
get_dist(method = "euclidian") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 736
fviz_dend(WARD)
mydata$clusterWard <- cutree(WARD, k = 4)
init_leaders <- aggregate(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], by = list(mydata$clusterWard), FUN = mean)
KMeans <- hkmeans(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], k=4, hc.metric = "euclidean", hc.method = "ward.D2")
KMeans
## Hierarchical K-means clustering with 4 clusters of sizes 180, 286, 191, 79
##
## Cluster means:
## ImpRich_z InternetUse_z PoliticsInt_z
## 1 0.49243017 -0.3172048 -1.12940929
## 2 0.59928801 -0.3256894 0.69914557
## 3 -1.34336849 -0.1826724 0.09353864
## 4 -0.04367619 2.3434741 -0.18389672
##
## Clustering vector:
## [1] 1 2 3 3 2 2 3 2 2 1 2 2 4 2 1 3 2 1 4 2 3 2 2 3 1 1 1 3 1 1 1 3 2 1 1 3 1
## [38] 1 2 3 1 2 3 2 2 1 2 2 3 2 2 2 3 2 2 3 3 2 3 2 1 2 2 2 4 3 1 4 3 2 3 1 1 1
## [75] 3 2 2 1 2 2 1 1 3 3 2 3 1 2 4 1 4 1 2 1 1 1 4 1 1 3 3 2 4 3 4 3 3 3 2 1 4
## [112] 3 3 1 2 4 1 1 4 2 1 4 1 3 2 1 2 1 4 1 2 1 2 2 2 2 1 2 1 2 1 4 1 1 1 2 2 2
## [149] 2 2 3 3 1 2 2 3 3 4 3 4 1 2 2 2 1 2 1 3 3 3 2 2 1 3 3 3 4 3 2 3 3 3 3 4 2
## [186] 1 1 3 2 2 2 3 2 2 3 3 2 1 2 2 3 2 2 1 3 3 1 1 1 2 3 1 1 2 1 2 2 2 3 1 2 2
## [223] 2 1 4 1 2 2 2 2 2 2 4 1 3 2 1 3 3 3 2 3 4 3 2 1 4 2 4 3 1 2 3 3 2 4 2 2 2
## [260] 3 1 1 2 2 1 2 2 1 1 4 2 2 2 2 2 2 1 2 3 1 4 2 1 1 3 2 2 2 4 2 2 2 1 1 2 4
## [297] 3 1 3 3 1 1 3 1 2 3 2 2 3 2 3 2 2 2 2 2 3 2 2 1 2 3 2 2 2 1 1 1 2 2 4 4 2
## [334] 1 1 4 1 2 1 3 3 1 3 1 2 4 2 1 2 4 4 3 2 3 3 2 4 2 1 4 3 1 4 3 2 2 3 2 1 2
## [371] 3 3 4 2 3 4 1 3 4 3 2 4 4 3 3 1 3 2 3 1 3 2 1 3 3 2 3 2 2 1 2 2 2 3 2 1 3
## [408] 1 1 2 2 4 4 3 2 3 3 3 3 4 2 3 1 1 2 2 3 3 3 3 3 2 2 2 3 3 3 2 2 2 1 3 2 3
## [445] 4 3 3 4 3 2 1 3 2 2 2 1 1 1 1 2 2 3 4 2 3 1 4 3 2 3 1 1 2 2 2 3 2 2 1 1 1
## [482] 1 4 2 3 3 2 2 4 1 2 1 1 4 1 1 4 2 2 2 3 2 2 1 2 2 1 2 3 3 4 2 3 3 2 2 2 2
## [519] 1 2 2 2 3 3 2 2 2 4 1 2 4 1 2 2 3 2 1 2 3 1 1 2 2 3 2 2 3 3 2 1 2 2 1 3 4
## [556] 2 3 1 3 2 2 1 4 2 2 2 3 3 2 2 3 2 2 2 1 4 1 1 3 2 3 1 3 3 2 1 2 1 3 3 2 1
## [593] 2 4 4 1 3 3 4 3 3 2 4 3 2 2 2 2 2 3 3 1 1 3 1 2 2 2 1 2 3 2 1 2 4 2 4 4 2
## [630] 1 2 3 1 3 1 3 1 2 1 1 1 3 3 1 2 1 2 4 2 3 4 3 3 1 2 2 2 2 3 1 3 2 2 1 2 4
## [667] 4 1 2 2 3 1 2 2 4 2 3 1 3 2 1 1 2 1 3 3 1 2 2 3 1 3 3 2 3 4 4 2 2 2 3 3 4
## [704] 2 2 1 2 2 2 1 4 2 4 3 2 4 4 3 2 1 3 1 2 2 2 2 1 3 3 3 2 3 1 1 3 1
##
## Within cluster sum of squares by cluster:
## [1] 167.9850 254.6558 267.6018 161.1347
## (between_SS / total_SS = 61.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
61.4% represents how well the groups are distinguished between each other and this value is not bad, so we can continue.
fviz_cluster(KMeans, palette = "jama", repel = FALSE, ggtheme = theme_classic(), geom = c("point"))
mydata$clusterKMeans <- KMeans$cluster
table(mydata$clusterWard, mydata$clusterKMeans)
##
## 1 2 3 4
## 1 173 0 68 5
## 2 0 233 0 0
## 3 0 53 123 2
## 4 7 0 0 72
We see that KMeans clustering algorythm has reclassified 68 units from 1st to 3rd group.
centroids <- KMeans$centers
centroids
## ImpRich_z InternetUse_z PoliticsInt_z
## 1 0.49243017 -0.3172048 -1.12940929
## 2 0.59928801 -0.3256894 0.69914557
## 3 -1.34336849 -0.1826724 0.09353864
## 4 -0.04367619 2.3434741 -0.18389672
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:pastecs':
##
## extract
figure <-as.data.frame(centroids)
figure$id <- 1:nrow(figure)
figure$Groups <- factor(figure$id, levels = c(1,2,3,4), labels=c("1", "2", "3", "4"))
ggplot(pivot_longer(figure, cols=c(ImpRich_z, InternetUse_z, PoliticsInt_z)), aes(x=name, y = value)) +
geom_hline(yintercept=0) +
geom_point(aes(shape=Groups, col=Groups), size=4) +
geom_line(aes(group=id), linewidth = 1) +
ylim(-1.5,2.5)
library(psych)
## Warning: package 'psych' was built under R version 4.3.2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:Hmisc':
##
## describe
describeBy(mydata, group = mydata$clusterKMeansF) # in R interface there's a switch between groups and it doesn't work within Knit. Don't know why...
## Warning in describeBy(mydata, group = mydata$clusterKMeansF): no grouping
## variable requested
## vars n mean sd median trimmed mad min max
## ID 1 736 657.11 372.04 656.50 655.41 475.17 1.00 1307.00
## ClimateChanging 2 736 3.17 0.79 3.00 3.20 0.00 1.00 5.00
## ImpRich 3 736 4.28 1.22 5.00 4.35 1.48 1.00 6.00
## PplTrust 4 736 4.67 2.39 5.00 4.75 2.97 0.00 10.00
## InternetUse 5 736 165.39 143.19 120.00 139.81 88.96 0.00 720.00
## PoliticsInt 6 736 2.76 0.85 3.00 2.78 1.48 1.00 4.00
## Happiness 7 736 7.84 1.64 8.00 8.02 1.48 0.00 10.00
## Gender 8 736 1.52 0.50 2.00 1.53 0.00 1.00 2.00
## Country* 9 736 1.00 0.00 1.00 1.00 0.00 1.00 1.00
## Age 10 736 39.73 15.42 40.00 39.17 19.27 15.00 94.00
## GenderF* 11 736 1.52 0.50 2.00 1.53 0.00 1.00 2.00
## ClimateChanging_z 12 736 0.00 1.00 -0.22 0.03 0.00 -2.76 2.32
## ImpRich_z 13 736 0.00 1.00 0.59 0.06 1.21 -2.68 1.41
## InternetUse_z 14 736 0.00 1.00 -0.32 -0.18 0.62 -1.16 3.87
## PplTrust_z 15 736 0.00 1.00 0.14 0.03 1.24 -1.95 2.23
## PoliticsInt_z 16 736 0.00 1.00 0.28 0.02 1.75 -2.08 1.46
## Happiness_z 17 736 0.00 1.00 0.10 0.11 0.90 -4.78 1.32
## Dissimilarity 18 736 1.54 0.62 1.47 1.50 0.58 0.35 3.55
## clusterWard 19 736 2.12 1.00 2.00 2.03 1.48 1.00 4.00
## clusterKMeans 20 736 2.23 0.94 2.00 2.16 1.48 1.00 4.00
## range skew kurtosis se
## ID 1306.00 0.04 -1.18 13.71
## ClimateChanging 4.00 -0.14 0.64 0.03
## ImpRich 5.00 -0.49 -0.67 0.05
## PplTrust 10.00 -0.23 -0.75 0.09
## InternetUse 720.00 1.59 2.18 5.28
## PoliticsInt 3.00 -0.12 -0.72 0.03
## Happiness 10.00 -1.17 2.04 0.06
## Gender 1.00 -0.09 -1.99 0.02
## Country* 0.00 NaN NaN 0.00
## Age 79.00 0.27 -0.70 0.57
## GenderF* 1.00 -0.09 -1.99 0.02
## ClimateChanging_z 5.08 -0.14 0.64 0.04
## ImpRich_z 4.09 -0.49 -0.67 0.04
## InternetUse_z 5.03 1.59 2.18 0.04
## PplTrust_z 4.18 -0.23 -0.75 0.04
## PoliticsInt_z 3.54 -0.12 -0.72 0.04
## Happiness_z 6.10 -1.17 2.04 0.04
## Dissimilarity 3.20 0.60 0.06 0.02
## clusterWard 3.00 0.41 -0.96 0.04
## clusterKMeans 3.00 0.31 -0.81 0.03
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.3.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata$clusterKMeansF <- factor(mydata$clusterKMeans, levels = c(1,2,3,4), labels=c("1", "2", "3", "4"))
# Normality tests for each group and variable
mydata %>% group_by(clusterKMeansF) %>% shapiro_test(ClimateChanging)
## # A tibble: 4 × 4
## clusterKMeansF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 ClimateChanging 0.871 2.81e-11
## 2 2 ClimateChanging 0.848 4.52e-16
## 3 3 ClimateChanging 0.820 4.20e-14
## 4 4 ClimateChanging 0.841 9.54e- 8
mydata %>% group_by(clusterKMeansF) %>% shapiro_test(ImpRich)
## # A tibble: 4 × 4
## clusterKMeansF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 ImpRich 0.843 1.23e-12
## 2 2 ImpRich 0.777 1.86e-19
## 3 3 ImpRich 0.635 5.13e-20
## 4 4 ImpRich 0.882 2.49e- 6
mydata %>% group_by(clusterKMeansF) %>% shapiro_test(PoliticsInt)
## # A tibble: 4 × 4
## clusterKMeansF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 PoliticsInt 0.483 1.08e-22
## 2 2 PoliticsInt 0.606 4.49e-25
## 3 3 PoliticsInt 0.842 4.01e-13
## 4 4 PoliticsInt 0.859 3.72e- 7
mydata %>% group_by(clusterKMeansF) %>% shapiro_test(Age)
## # A tibble: 4 × 4
## clusterKMeansF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 Age 0.991 0.349
## 2 2 Age 0.963 0.00000102
## 3 3 Age 0.934 0.000000133
## 4 4 Age 0.976 0.147
mydata %>% group_by(clusterKMeansF) %>% shapiro_test(PplTrust)
## # A tibble: 4 × 4
## clusterKMeansF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 PplTrust 0.959 0.0000370
## 2 2 PplTrust 0.957 0.000000170
## 3 3 PplTrust 0.957 0.0000139
## 4 4 PplTrust 0.929 0.000272
mydata %>% group_by(clusterKMeansF) %>% shapiro_test(Happiness)
## # A tibble: 4 × 4
## clusterKMeansF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 Happiness 0.918 1.59e- 8
## 2 2 Happiness 0.872 1.05e-14
## 3 3 Happiness 0.888 9.95e-11
## 4 4 Happiness 0.905 2.23e- 5
Normality is met only for 1st and 4th groups for Age variable. So we should use non-parametric tests
# Just to see how the distributions of variables look within groups.
ggplot(mydata, aes(x=ImpRich)) +
geom_histogram(binwidth=1, colour="green") +
facet_wrap(~clusterKMeansF, ncol=2)
ggplot(mydata, aes(x=InternetUse)) +
geom_histogram(binwidth=60, colour="green") +
facet_wrap(~clusterKMeansF, ncol=2)
ggplot(mydata, aes(x=PoliticsInt)) +
geom_histogram(binwidth=1, colour="green") +
facet_wrap(~clusterKMeansF, ncol=2)
ggplot(mydata, aes(x=Age)) +
geom_histogram(binwidth=1, colour="green") +
facet_wrap(~clusterKMeansF, ncol=2)
ggplot(mydata, aes(x=PplTrust)) +
geom_histogram(binwidth=1, colour="green") +
facet_wrap(~clusterKMeansF, ncol=2)
ggplot(mydata, aes(x=Happiness)) +
geom_histogram(binwidth=1, colour="green") +
facet_wrap(~clusterKMeansF, ncol=2)
kruskal.test(ImpRich ~ clusterKMeansF, data=mydata)
##
## Kruskal-Wallis rank sum test
##
## data: ImpRich by clusterKMeansF
## Kruskal-Wallis chi-squared = 450.64, df = 3, p-value < 2.2e-16
kruskal.test(InternetUse ~ clusterKMeansF, data=mydata)
##
## Kruskal-Wallis rank sum test
##
## data: InternetUse by clusterKMeansF
## Kruskal-Wallis chi-squared = 220.45, df = 3, p-value < 2.2e-16
kruskal.test(PoliticsInt ~ clusterKMeansF, data=mydata)
##
## Kruskal-Wallis rank sum test
##
## data: PoliticsInt by clusterKMeansF
## Kruskal-Wallis chi-squared = 394.95, df = 3, p-value < 2.2e-16
wilcox_test(InternetUse ~ clusterKMeansF, paired = FALSE, p.adjust.method = "bonferroni", data = mydata)
## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 InternetUse 1 2 180 286 26066. 8.16e- 1 1 e+ 0 ns
## 2 InternetUse 1 3 180 191 14771 1.8 e- 2 1.08e- 1 ns
## 3 InternetUse 1 4 180 79 13 1.08e-37 6.48e-37 ****
## 4 InternetUse 2 3 286 191 23150 4 e- 3 2.6 e- 2 *
## 5 InternetUse 2 4 286 79 32 2.12e-42 1.27e-41 ****
## 6 InternetUse 3 4 191 79 25.5 2.90e-38 1.74e-37 ****
We use non-parametric test here as the normality is not met. And for all 3 variables we reject the H0 that location distributions for the variable are the same for each group. So variables quite successfully classify units into groups.
Then we do validation taking into account other variables which weren’t used for classification.
chisq_results <- chisq.test(mydata$GenderF, mydata$clusterKMeansF)
chisq_results
##
## Pearson's Chi-squared test
##
## data: mydata$GenderF and mydata$clusterKMeansF
## X-squared = 34.889, df = 3, p-value = 1.286e-07
addmargins(chisq_results$observed)
## mydata$clusterKMeansF
## mydata$GenderF 1 2 3 4 Sum
## M 88 104 122 37 351
## F 92 182 69 42 385
## Sum 180 286 191 79 736
round(addmargins(chisq_results$expected), 2)
## mydata$clusterKMeansF
## mydata$GenderF 1 2 3 4 Sum
## M 85.84 136.39 91.09 37.68 351
## F 94.16 149.61 99.91 41.32 385
## Sum 180.00 286.00 191.00 79.00 736
round(chisq_results$res, 2)
## mydata$clusterKMeansF
## mydata$GenderF 1 2 3 4
## M 0.23 -2.77 3.24 -0.11
## F -0.22 2.65 -3.09 0.11
kruskal.test(Age ~ clusterKMeansF, data=mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Age by clusterKMeansF
## Kruskal-Wallis chi-squared = 72.949, df = 3, p-value = 9.967e-16
wilcox_test(Age ~ clusterKMeansF, paired = FALSE, p.adjust.method = "bonferroni", data = mydata)
## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Age 1 2 180 286 36116. 2.26e-13 1.36e-12 ****
## 2 Age 1 3 180 191 24776. 2 e-13 1.20e-12 ****
## 3 Age 1 4 180 79 9872. 6.51e- 7 3.91e- 6 ****
## 4 Age 2 3 286 191 29068 2.34e- 1 1 e+ 0 ns
## 5 Age 2 4 286 79 10330. 2.44e- 1 1 e+ 0 ns
## 6 Age 3 4 191 79 6318 3.6 e- 2 2.14e- 1 ns
aggregate(mydata$Age, by = list(mydata$clusterKMeansF), FUN = "median")
## Group.1 x
## 1 1 49
## 2 2 36
## 3 3 33
## 4 4 39
kruskal.test(PplTrust ~ clusterKMeansF, data=mydata)
##
## Kruskal-Wallis rank sum test
##
## data: PplTrust by clusterKMeansF
## Kruskal-Wallis chi-squared = 9.4093, df = 3, p-value = 0.02432
wilcox_test(PplTrust ~ clusterKMeansF, paired = FALSE, p.adjust.method = "bonferroni", data = mydata)
## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 PplTrust 1 2 180 286 28239 0.075 0.448 ns
## 2 PplTrust 1 3 180 191 18726 0.134 0.804 ns
## 3 PplTrust 1 4 180 79 6430 0.217 1 ns
## 4 PplTrust 2 3 286 191 27120. 0.895 1 ns
## 5 PplTrust 2 4 286 79 9080 0.007 0.042 *
## 6 PplTrust 3 4 191 79 6162. 0.017 0.101 ns
aggregate(mydata$PplTrust, by = list(mydata$clusterKMeansF), FUN = "median")
## Group.1 x
## 1 1 5
## 2 2 5
## 3 3 5
## 4 4 5
kruskal.test(ClimateChanging ~ clusterKMeansF, data=mydata)
##
## Kruskal-Wallis rank sum test
##
## data: ClimateChanging by clusterKMeansF
## Kruskal-Wallis chi-squared = 0.041572, df = 3, p-value = 0.9978
aggregate(mydata$ClimateChanging, by = list(mydata$clusterKMeansF), FUN = "median")
## Group.1 x
## 1 1 3
## 2 2 3
## 3 3 3
## 4 4 3
kruskal.test(Happiness ~ clusterKMeansF, data=mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Happiness by clusterKMeansF
## Kruskal-Wallis chi-squared = 5.3491, df = 3, p-value = 0.1479
aggregate(mydata$Happiness, by = list(mydata$clusterKMeansF), FUN = "median")
## Group.1 x
## 1 1 8
## 2 2 8
## 3 3 8
## 4 4 8
I classified 1307 (736 after removing outliers and non-response units) Slovenian citizens was based on 3 standardized variables.
For hierarchical clustering Ward’s algorithm was used, and based on the analysis of the dendrogram i decided to classify units into 3 groups.
The classification was further optimized using the K-Means clustering.
Group 3 (26%, 191/736 of respondents) represents the second largest group of Slovenian citizens who don’t associate themselves with a person for whom it’s important to be rich and have a lot of money. There are more than expected number of men and less than expected number of women in this group (p<0.05).
Group 1 (24.4%, 180/736 of respondents) represents the third group of Slovenian citizens by amount. They are more interested in politics than average. Its median age is the highest comparing with other groups (p<0.05).
Group 4 (10,7%, 79/736 of respondents) represents the group that daily spends statistically significantly more time in the internet than other groups (p<0.001).
For all groups there is no statistical difference in level of worry about the climate change and level of happiness.