Yury Kapralov. Student ID 19619871. Clustering

Data source: European Social Survey European Research Infrastructure Consortium (ESS ERIC)

ESS round 8 (2016) edition 2.3 (published 24.11.23). Welfare attitudes, Attitudes to climate change.

Research question:

What can be the classification of 15+ aged persons living in Slovenia using this data? Which insights about the population we can get from this data?

For such tasks we use clustering. It divides units into groups which we can use for finding following diffirencies and similarities as well as discover those groups separately to get more insights and understanding of their properties.

mydataSource <- read.table("./ess8_23.csv", header = TRUE, sep = ",", dec = ".", quote = "\"")

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydataSource[c("idno", "wrclmch", "imprich", "ppltrst", "netustm", "polintr", "happy", "gndr", "cntry", "agea")]
names(mydata)[1] <- "ID"
names(mydata)[2] <- "ClimateChanging"
names(mydata)[3] <- "ImpRich"
names(mydata)[4] <- "PplTrust"
names(mydata)[5] <- "InternetUse"
names(mydata)[6] <- "PoliticsInt"
names(mydata)[7] <- "Happiness"
names(mydata)[8] <- "Gender"
names(mydata)[9] <- "Country"
names(mydata)[10] <- "Age"
mydata$GenderF <- factor(mydata$Gender, levels=c(1,2), labels=c("M", "F"))

Sampling:

Clusters of Enumeration Areas (CEA) and the Central Register of Population (CRP).

Two-stage probability sampling design with stratification at the first stage.

Probability: Multistage

Units: 15+ aged persons citizens of Slovenia.

Sample size: 1307 persons.

Variables:

ClimateChanging: How worried are you about climate change? (1 - Not at all worried, 5 - Extremely worried)

ImpRich: It is important to her/him to be rich. She/he wants to have a lot of money and expensive things. (1 = It’s very much like me, 6 = Not like me at all)

PplTrust: Would you say that most people can be trusted, or that you can’t be too careful in dealing with people? (0 = You can’t be too careful, 10 = Most people can be trusted)

InternetUse: On a typical day, about how much time do you spend using the internet on a computer, tablet, smartphone or other device, whether for work or personal use? In minutes.

PoliticsInt: How interested would you say you are in politics - are you… (1 = Very interested, 4 = Not at all interested)

Happiness: Taking all things together, how happy would you say you are? (0 = Extremely unhappy, 10 = Extremely happy)

Gender: Respondent’s gender. (1 = Male, 2 = Female)

Country: Country code (SI = Slovenia)

Age: Respondent’s age in years.

# Filtering out units with unanswered questions and keeping only data for Slovenia
mydata <- mydata[mydata$Happiness < 11, ]
mydata <- mydata[mydata$ClimateChanging < 6, ]
mydata <- mydata[mydata$ImpRich < 7, ]
mydata <- mydata[mydata$PplTrust < 11, ]
mydata <- mydata[mydata$InternetUse < 6666, ]
mydata <- mydata[mydata$PoliticsInt < 5, ]
mydata <- mydata[mydata$Happiness < 11, ]
mydata <- mydata[mydata$Gender < 3, ]
mydata <- mydata[mydata$Gender < 900, ]
library(dplyr)
mydata <- filter(mydata, Country == "SI")
head(mydata)

##   ID ClimateChanging ImpRich PplTrust InternetUse PoliticsInt Happiness Gender
## 1  1               3       6        5          30           1        10      2
## 2  2               3       6        7         300           3         9      2
## 3  3               2       3        3         120           3         9      1
## 4  4               3       3        4          60           3         7      1
## 5  7               2       6        2         120           3         9      1
## 6  9               4       6        5         240           3         8      2
##   Country Age GenderF
## 1      SI  57       F
## 2      SI  41       F
## 3      SI  29       M
## 4      SI  44       M
## 5      SI  32       M
## 6      SI  29       F

“Not applicable” values of InternetUse value cuts off quite a lot of data. So it’s likely that we remove information about one of groups that could be represented in clustering. If we need to use the variable about internet usage it could be an option to use another variable (for example, “How often do you use the internet on these or any other devices, whether for work or personal use?” Never … Every day) measured on Likert scale.

But I’ve decided to skip this issue to have a numeric ratio variable among clustering variables.

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

round(stat.desc(mydata), 2)

##                     ID ClimateChanging ImpRich PplTrust InternetUse PoliticsInt
## nbr.val         747.00          747.00  747.00   747.00      747.00      747.00
## nbr.null          0.00            0.00    0.00    46.00        1.00        0.00
## nbr.na            0.00            0.00    0.00     0.00        0.00        0.00
## min               1.00            1.00    1.00     0.00        0.00        1.00
## max            1307.00            5.00    6.00    10.00     1440.00        4.00
## range          1306.00            4.00    5.00    10.00     1440.00        3.00
## sum          490770.00         2373.00 3182.00  3486.00   129885.00     2065.00
## median          656.00            3.00    5.00     5.00      120.00        3.00
## mean            656.99            3.18    4.26     4.67      173.88        2.76
## SE.mean          13.60            0.03    0.05     0.09        5.91        0.03
## CI.mean.0.95     26.70            0.06    0.09     0.17       11.60        0.06
## var          138143.50            0.62    1.54     5.71    26092.67        0.73
## std.dev         371.68            0.79    1.24     2.39      161.53        0.85
## coef.var          0.57            0.25    0.29     0.51        0.93        0.31
##              Happiness  Gender Country      Age GenderF
## nbr.val         747.00  747.00      NA   747.00      NA
## nbr.null          2.00    0.00      NA     0.00      NA
## nbr.na            0.00    0.00      NA     0.00      NA
## min               0.00    1.00      NA    15.00      NA
## max              10.00    2.00      NA    94.00      NA
## range            10.00    1.00      NA    79.00      NA
## sum            5859.00 1136.00      NA 29618.00      NA
## median            8.00    2.00      NA    40.00      NA
## mean              7.84    1.52      NA    39.65      NA
## SE.mean           0.06    0.02      NA     0.56      NA
## CI.mean.0.95      0.12    0.04      NA     1.10      NA
## var               2.69    0.25      NA   236.54      NA
## std.dev           1.64    0.50      NA    15.38      NA
## coef.var          0.21    0.33      NA     0.39      NA

#Standardizing
mydata$ClimateChanging_z <- scale(mydata$ClimateChanging)
mydata$ImpRich_z <- scale(mydata$ImpRich)
mydata$InternetUse_z <- scale(mydata$InternetUse)
mydata$PplTrust_z <- scale(mydata$PplTrust)
mydata$PoliticsInt_z <- scale(mydata$PoliticsInt)
mydata$Happiness_z <- scale(mydata$Happiness)
library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.3.2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[, c("ClimateChanging_z", "ImpRich_z", "InternetUse_z", "PplTrust_z", "PoliticsInt_z", "Happiness_z")]), type="pearson")

##                   ClimateChanging_z ImpRich_z InternetUse_z PplTrust_z
## ClimateChanging_z              1.00     -0.02          0.03      -0.01
## ImpRich_z                     -0.02      1.00         -0.11       0.01
## InternetUse_z                  0.03     -0.11          1.00       0.05
## PplTrust_z                    -0.01      0.01          0.05       1.00
## PoliticsInt_z                 -0.02     -0.01         -0.01      -0.14
## Happiness_z                   -0.08      0.04          0.04       0.13
##                   PoliticsInt_z Happiness_z
## ClimateChanging_z         -0.02       -0.08
## ImpRich_z                 -0.01        0.04
## InternetUse_z             -0.01        0.04
## PplTrust_z                -0.14        0.13
## PoliticsInt_z              1.00        0.04
## Happiness_z                0.04        1.00
## 
## n= 747 
## 
## 
## P
##                   ClimateChanging_z ImpRich_z InternetUse_z PplTrust_z
## ClimateChanging_z                   0.5671    0.4617        0.6974    
## ImpRich_z         0.5671                      0.0029        0.6868    
## InternetUse_z     0.4617            0.0029                  0.1366    
## PplTrust_z        0.6974            0.6868    0.1366                  
## PoliticsInt_z     0.5898            0.8548    0.8482        0.0000    
## Happiness_z       0.0351            0.2544    0.2425        0.0002    
##                   PoliticsInt_z Happiness_z
## ClimateChanging_z 0.5898        0.0351     
## ImpRich_z         0.8548        0.2544     
## InternetUse_z     0.8482        0.2425     
## PplTrust_z        0.0000        0.0002     
## PoliticsInt_z                   0.2246     
## Happiness_z       0.2246

Variables don’t correlate, it’s good for clustering.

After trying various combinations of variables for clustering I ended up on using only three of them:

Importance to be rich, Internet usage and interest in politics. So I will use these variables for classification.

We assume that there’s a theoretical basis for such way of classification.

mydata$Dissimilarity <- sqrt(mydata$ImpRich_z^2+mydata$InternetUse_z^2+mydata$PoliticsInt_z^2)
head(mydata[order(-mydata$Dissimilarity), c("Dissimilarity")], 20)

##           [,1]
##  [1,] 7.993320
##  [2,] 4.926979
##  [3,] 4.694895
##  [4,] 4.138332
##  [5,] 4.091771
##  [6,] 3.995694
##  [7,] 3.850469
##  [8,] 3.548261
##  [9,] 3.405282
## [10,] 3.405282
## [11,] 3.398627
## [12,] 3.345353
## [13,] 3.328939
## [14,] 3.320717
## [15,] 3.251921
## [16,] 3.188211
## [17,] 3.176468
## [18,] 3.176468
## [19,] 3.176468
## [20,] 3.119500

mydata <- filter(mydata, Dissimilarity < 3.8) # Removing first 3 units with high dissimilarity

## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

ggplot(mydata, aes(x = Dissimilarity)) +
  geom_histogram(binwidth = 0.1, colour = "Black")

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#mydataSample <- mydata[seq(1, nrow(mydata), 10), ]
distance2 <- get_dist(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], method = "euclidian") ^ 2
fviz_dist(distance2)

get_clust_tendency(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], n=nrow(mydata)-1, graph = FALSE)

## $hopkins_stat
## [1] 0.7283589
## 
## $plot
## NULL

Hopkins statistics is more than 0.5 so the data is suitable for classification

library(dplyr)
WARD <- mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")] %>%
  get_dist(method = "euclidian") %>%
  hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 740

For the first stage we use WARD’s method and squared Euclidian distances.

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

After several combinations of variables and number of groups…

I’m cutting this dendrogram at level of 4 groups

mydata$clusterWard <- cutree(WARD, k = 4)
init_leaders <- aggregate(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], by = list(mydata$clusterWard), FUN = mean)
init_leaders

##   Group.1   ImpRich_z InternetUse_z PoliticsInt_z
## 1       1  0.06416173    -0.4012664    -1.0952644
## 2       2  0.79410172    -0.3685200     0.6839276
## 3       3 -0.97510248    -0.2063242     0.6889884
## 4       4 -0.19388327     1.7531596    -0.3776804

KMeans <- hkmeans(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], k=4, hc.metric = "euclidean", hc.method = "ward.D2")
KMeans

## Hierarchical K-means clustering with 4 clusters of sizes 223, 286, 143, 88
## 
## Cluster means:
##    ImpRich_z InternetUse_z PoliticsInt_z
## 1  0.1955457    -0.3421615    -1.1014149
## 2  0.6081403    -0.3412554     0.6944857
## 3 -1.3933948    -0.1872885     0.4812825
## 4 -0.1544164     1.9355800    -0.2700538
## 
## Clustering vector:
##   [1] 1 2 3 3 2 2 1 2 2 1 2 2 4 2 1 3 3 2 1 4 2 3 2 2 1 1 1 1 3 1 1 1 3 2 1 1 3
##  [38] 1 1 2 3 1 2 3 2 2 1 2 2 3 2 2 2 3 2 2 3 1 2 3 2 1 2 2 4 2 4 1 1 4 3 2 3 1
##  [75] 1 1 3 2 2 1 2 2 1 1 3 3 2 3 1 2 4 1 4 1 2 1 1 1 4 1 1 1 3 2 4 3 4 3 3 3 2
## [112] 1 4 4 3 1 2 4 1 1 4 2 1 4 1 3 2 1 2 1 4 1 2 1 2 2 2 2 1 2 1 2 1 4 1 1 1 2
## [149] 2 2 2 2 1 3 1 2 2 3 3 4 3 4 1 2 2 2 1 2 1 1 3 3 2 2 1 1 3 3 4 4 2 1 3 3 3
## [186] 4 2 1 1 3 2 2 2 3 2 2 3 3 2 1 2 2 3 2 2 1 3 3 1 1 1 2 1 1 1 2 1 2 2 2 3 1
## [223] 2 2 2 1 4 1 2 2 2 2 2 2 4 1 1 2 1 3 1 1 2 3 4 3 2 1 4 2 4 3 1 2 1 3 2 4 2
## [260] 2 2 1 1 1 2 2 1 2 2 1 1 4 2 2 2 2 2 2 1 2 3 1 4 2 1 1 1 2 2 2 4 2 2 2 1 1
## [297] 2 4 3 1 3 1 1 1 1 1 2 3 2 2 3 2 3 2 2 2 2 2 3 2 2 1 2 3 2 2 2 1 1 1 2 2 4
## [334] 4 2 1 1 4 1 2 1 3 3 1 3 1 2 4 2 1 2 4 4 1 2 3 1 2 4 2 1 4 1 1 4 3 2 2 4 2
## [371] 1 2 3 4 4 2 4 4 1 3 1 4 3 2 4 4 1 3 1 3 2 3 1 3 2 1 1 3 2 3 2 2 1 2 2 2 3
## [408] 2 1 3 1 1 2 2 4 4 3 2 1 1 3 3 4 2 3 1 1 2 2 3 3 3 3 3 2 2 2 3 3 3 2 2 2 1
## [445] 1 2 3 4 3 3 4 3 2 1 3 2 2 2 1 1 1 1 2 2 3 4 2 3 1 4 3 2 3 1 1 2 2 2 3 2 2
## [482] 1 1 1 1 4 2 1 3 2 2 4 1 2 1 1 4 1 1 4 2 2 2 3 2 2 1 2 2 1 2 3 3 4 2 3 1 2
## [519] 2 2 2 1 2 2 2 1 3 2 2 2 4 1 2 4 1 2 2 3 2 1 2 1 1 1 2 2 3 2 2 1 3 2 1 2 2
## [556] 1 3 4 2 3 1 3 2 2 1 4 2 2 2 3 1 2 2 3 2 2 2 1 4 1 1 1 2 3 1 3 3 2 1 2 1 3
## [593] 3 2 1 2 4 4 1 3 3 4 3 3 2 4 3 2 2 2 2 2 1 3 1 1 3 1 2 2 2 1 2 1 2 1 2 4 2
## [630] 4 4 2 1 2 1 1 3 1 3 1 2 1 1 4 1 4 1 1 2 1 2 4 2 1 4 3 3 1 2 2 2 2 3 1 3 2
## [667] 2 1 2 4 4 1 2 2 3 1 2 2 4 2 3 1 3 2 1 1 2 1 3 3 1 2 2 3 1 3 3 2 1 4 4 2 2
## [704] 2 1 3 4 2 2 1 2 2 2 1 4 2 4 3 2 4 4 1 2 1 3 1 2 2 2 2 1 4 3 1 2 3 1 1 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 256.1372 232.3530 156.7989 181.6751
##  (between_SS / total_SS =  59.9 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

fviz_cluster(KMeans, palette = "jama", repel = FALSE, ggtheme = theme_classic(), geom = c("point"))

fviz_cluster(KMeans, palette = "jama", repel = FALSE, ggtheme = theme_classic())

mydata[c(65,644,380,16),] # distinguishing outliers

##       ID ClimateChanging ImpRich PplTrust InternetUse PoliticsInt Happiness
## 65   135               3       2        5         480           1         7
## 644 1134               3       2        7         600           2         7
## 380  668               5       1        5         480           3        10
## 16    36               3       2        9         480           4        10
##     Gender Country Age GenderF ClimateChanging_z ImpRich_z InternetUse_z
## 65       1      SI  27       M        -0.2243804 -1.821892      1.895129
## 644      2      SI  28       F        -0.2243804 -1.821892      2.638015
## 380      1      SI  60       M         2.3151975 -2.628145      1.895129
## 16       1      SI  46       M        -0.2243804 -1.821892      1.895129
##     PplTrust_z PoliticsInt_z Happiness_z Dissimilarity clusterWard
## 65   0.1395228    -2.0689559  -0.5139662      3.345353           4
## 644  0.9766599    -0.8963382  -0.5139662      3.328939           4
## 380  0.1395228     0.2762794   1.3142850      3.251921           4
## 16   1.8137970     1.4488970   1.3142850      3.001685           4

mydata <- filter(mydata, ID !=135 & ID != 1134 & ID != 668 & ID != 36) # removing outliers and repeating the classification

mydata$ClimateChanging_z <- scale(mydata$ClimateChanging)
mydata$ImpRich_z <- scale(mydata$ImpRich)
mydata$InternetUse_z <- scale(mydata$InternetUse)
mydata$PplTrust_z <- scale(mydata$PplTrust)
mydata$PoliticsInt_z <- scale(mydata$PoliticsInt)
mydata$Happiness_z <- scale(mydata$Happiness)

WARD <- mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")] %>%
  get_dist(method = "euclidian") %>%
  hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 736

fviz_dend(WARD)

mydata$clusterWard <- cutree(WARD, k = 4)
init_leaders <- aggregate(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], by = list(mydata$clusterWard), FUN = mean)
KMeans <- hkmeans(mydata[c("ImpRich_z", "InternetUse_z", "PoliticsInt_z")], k=4, hc.metric = "euclidean", hc.method = "ward.D2")
KMeans

## Hierarchical K-means clustering with 4 clusters of sizes 180, 286, 191, 79
## 
## Cluster means:
##     ImpRich_z InternetUse_z PoliticsInt_z
## 1  0.49243017    -0.3172048   -1.12940929
## 2  0.59928801    -0.3256894    0.69914557
## 3 -1.34336849    -0.1826724    0.09353864
## 4 -0.04367619     2.3434741   -0.18389672
## 
## Clustering vector:
##   [1] 1 2 3 3 2 2 3 2 2 1 2 2 4 2 1 3 2 1 4 2 3 2 2 3 1 1 1 3 1 1 1 3 2 1 1 3 1
##  [38] 1 2 3 1 2 3 2 2 1 2 2 3 2 2 2 3 2 2 3 3 2 3 2 1 2 2 2 4 3 1 4 3 2 3 1 1 1
##  [75] 3 2 2 1 2 2 1 1 3 3 2 3 1 2 4 1 4 1 2 1 1 1 4 1 1 3 3 2 4 3 4 3 3 3 2 1 4
## [112] 3 3 1 2 4 1 1 4 2 1 4 1 3 2 1 2 1 4 1 2 1 2 2 2 2 1 2 1 2 1 4 1 1 1 2 2 2
## [149] 2 2 3 3 1 2 2 3 3 4 3 4 1 2 2 2 1 2 1 3 3 3 2 2 1 3 3 3 4 3 2 3 3 3 3 4 2
## [186] 1 1 3 2 2 2 3 2 2 3 3 2 1 2 2 3 2 2 1 3 3 1 1 1 2 3 1 1 2 1 2 2 2 3 1 2 2
## [223] 2 1 4 1 2 2 2 2 2 2 4 1 3 2 1 3 3 3 2 3 4 3 2 1 4 2 4 3 1 2 3 3 2 4 2 2 2
## [260] 3 1 1 2 2 1 2 2 1 1 4 2 2 2 2 2 2 1 2 3 1 4 2 1 1 3 2 2 2 4 2 2 2 1 1 2 4
## [297] 3 1 3 3 1 1 3 1 2 3 2 2 3 2 3 2 2 2 2 2 3 2 2 1 2 3 2 2 2 1 1 1 2 2 4 4 2
## [334] 1 1 4 1 2 1 3 3 1 3 1 2 4 2 1 2 4 4 3 2 3 3 2 4 2 1 4 3 1 4 3 2 2 3 2 1 2
## [371] 3 3 4 2 3 4 1 3 4 3 2 4 4 3 3 1 3 2 3 1 3 2 1 3 3 2 3 2 2 1 2 2 2 3 2 1 3
## [408] 1 1 2 2 4 4 3 2 3 3 3 3 4 2 3 1 1 2 2 3 3 3 3 3 2 2 2 3 3 3 2 2 2 1 3 2 3
## [445] 4 3 3 4 3 2 1 3 2 2 2 1 1 1 1 2 2 3 4 2 3 1 4 3 2 3 1 1 2 2 2 3 2 2 1 1 1
## [482] 1 4 2 3 3 2 2 4 1 2 1 1 4 1 1 4 2 2 2 3 2 2 1 2 2 1 2 3 3 4 2 3 3 2 2 2 2
## [519] 1 2 2 2 3 3 2 2 2 4 1 2 4 1 2 2 3 2 1 2 3 1 1 2 2 3 2 2 3 3 2 1 2 2 1 3 4
## [556] 2 3 1 3 2 2 1 4 2 2 2 3 3 2 2 3 2 2 2 1 4 1 1 3 2 3 1 3 3 2 1 2 1 3 3 2 1
## [593] 2 4 4 1 3 3 4 3 3 2 4 3 2 2 2 2 2 3 3 1 1 3 1 2 2 2 1 2 3 2 1 2 4 2 4 4 2
## [630] 1 2 3 1 3 1 3 1 2 1 1 1 3 3 1 2 1 2 4 2 3 4 3 3 1 2 2 2 2 3 1 3 2 2 1 2 4
## [667] 4 1 2 2 3 1 2 2 4 2 3 1 3 2 1 1 2 1 3 3 1 2 2 3 1 3 3 2 3 4 4 2 2 2 3 3 4
## [704] 2 2 1 2 2 2 1 4 2 4 3 2 4 4 3 2 1 3 1 2 2 2 2 1 3 3 3 2 3 1 1 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 167.9850 254.6558 267.6018 161.1347
##  (between_SS / total_SS =  61.4 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

61.4% represents how well the groups are distinguished between each other and this value is not bad, so we can continue.

fviz_cluster(KMeans, palette = "jama", repel = FALSE, ggtheme = theme_classic(), geom = c("point"))

mydata$clusterKMeans <- KMeans$cluster
table(mydata$clusterWard, mydata$clusterKMeans)

##    
##       1   2   3   4
##   1 173   0  68   5
##   2   0 233   0   0
##   3   0  53 123   2
##   4   7   0   0  72

We see that KMeans clustering algorythm has reclassified 68 units from 1st to 3rd group.

centroids <- KMeans$centers
centroids

##     ImpRich_z InternetUse_z PoliticsInt_z
## 1  0.49243017    -0.3172048   -1.12940929
## 2  0.59928801    -0.3256894    0.69914557
## 3 -1.34336849    -0.1826724    0.09353864
## 4 -0.04367619     2.3434741   -0.18389672

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:pastecs':
## 
##     extract

figure <-as.data.frame(centroids)
figure$id <- 1:nrow(figure)
figure$Groups <- factor(figure$id, levels = c(1,2,3,4), labels=c("1", "2", "3", "4"))
ggplot(pivot_longer(figure, cols=c(ImpRich_z, InternetUse_z, PoliticsInt_z)), aes(x=name, y = value)) +
  geom_hline(yintercept=0) +
  geom_point(aes(shape=Groups, col=Groups), size=4) +
  geom_line(aes(group=id), linewidth = 1) +
  ylim(-1.5,2.5)

library(psych)

## Warning: package 'psych' was built under R version 4.3.2

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## The following object is masked from 'package:Hmisc':
## 
##     describe

describeBy(mydata, group = mydata$clusterKMeansF) # in R interface there's a switch between groups and it doesn't work within Knit. Don't know why...

## Warning in describeBy(mydata, group = mydata$clusterKMeansF): no grouping
## variable requested

##                   vars   n   mean     sd median trimmed    mad   min     max
## ID                   1 736 657.11 372.04 656.50  655.41 475.17  1.00 1307.00
## ClimateChanging      2 736   3.17   0.79   3.00    3.20   0.00  1.00    5.00
## ImpRich              3 736   4.28   1.22   5.00    4.35   1.48  1.00    6.00
## PplTrust             4 736   4.67   2.39   5.00    4.75   2.97  0.00   10.00
## InternetUse          5 736 165.39 143.19 120.00  139.81  88.96  0.00  720.00
## PoliticsInt          6 736   2.76   0.85   3.00    2.78   1.48  1.00    4.00
## Happiness            7 736   7.84   1.64   8.00    8.02   1.48  0.00   10.00
## Gender               8 736   1.52   0.50   2.00    1.53   0.00  1.00    2.00
## Country*             9 736   1.00   0.00   1.00    1.00   0.00  1.00    1.00
## Age                 10 736  39.73  15.42  40.00   39.17  19.27 15.00   94.00
## GenderF*            11 736   1.52   0.50   2.00    1.53   0.00  1.00    2.00
## ClimateChanging_z   12 736   0.00   1.00  -0.22    0.03   0.00 -2.76    2.32
## ImpRich_z           13 736   0.00   1.00   0.59    0.06   1.21 -2.68    1.41
## InternetUse_z       14 736   0.00   1.00  -0.32   -0.18   0.62 -1.16    3.87
## PplTrust_z          15 736   0.00   1.00   0.14    0.03   1.24 -1.95    2.23
## PoliticsInt_z       16 736   0.00   1.00   0.28    0.02   1.75 -2.08    1.46
## Happiness_z         17 736   0.00   1.00   0.10    0.11   0.90 -4.78    1.32
## Dissimilarity       18 736   1.54   0.62   1.47    1.50   0.58  0.35    3.55
## clusterWard         19 736   2.12   1.00   2.00    2.03   1.48  1.00    4.00
## clusterKMeans       20 736   2.23   0.94   2.00    2.16   1.48  1.00    4.00
##                     range  skew kurtosis    se
## ID                1306.00  0.04    -1.18 13.71
## ClimateChanging      4.00 -0.14     0.64  0.03
## ImpRich              5.00 -0.49    -0.67  0.05
## PplTrust            10.00 -0.23    -0.75  0.09
## InternetUse        720.00  1.59     2.18  5.28
## PoliticsInt          3.00 -0.12    -0.72  0.03
## Happiness           10.00 -1.17     2.04  0.06
## Gender               1.00 -0.09    -1.99  0.02
## Country*             0.00   NaN      NaN  0.00
## Age                 79.00  0.27    -0.70  0.57
## GenderF*             1.00 -0.09    -1.99  0.02
## ClimateChanging_z    5.08 -0.14     0.64  0.04
## ImpRich_z            4.09 -0.49    -0.67  0.04
## InternetUse_z        5.03  1.59     2.18  0.04
## PplTrust_z           4.18 -0.23    -0.75  0.04
## PoliticsInt_z        3.54 -0.12    -0.72  0.04
## Happiness_z          6.10 -1.17     2.04  0.04
## Dissimilarity        3.20  0.60     0.06  0.02
## clusterWard          3.00  0.41    -0.96  0.04
## clusterKMeans        3.00  0.31    -0.81  0.03

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.3.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata$clusterKMeansF <- factor(mydata$clusterKMeans, levels = c(1,2,3,4), labels=c("1", "2", "3", "4"))
# Normality tests for each group and variable
mydata %>% group_by(clusterKMeansF) %>% shapiro_test(ClimateChanging)

## # A tibble: 4 × 4
##   clusterKMeansF variable        statistic        p
##   <fct>          <chr>               <dbl>    <dbl>
## 1 1              ClimateChanging     0.871 2.81e-11
## 2 2              ClimateChanging     0.848 4.52e-16
## 3 3              ClimateChanging     0.820 4.20e-14
## 4 4              ClimateChanging     0.841 9.54e- 8

mydata %>% group_by(clusterKMeansF) %>% shapiro_test(ImpRich)

## # A tibble: 4 × 4
##   clusterKMeansF variable statistic        p
##   <fct>          <chr>        <dbl>    <dbl>
## 1 1              ImpRich      0.843 1.23e-12
## 2 2              ImpRich      0.777 1.86e-19
## 3 3              ImpRich      0.635 5.13e-20
## 4 4              ImpRich      0.882 2.49e- 6

mydata %>% group_by(clusterKMeansF) %>% shapiro_test(PoliticsInt)

## # A tibble: 4 × 4
##   clusterKMeansF variable    statistic        p
##   <fct>          <chr>           <dbl>    <dbl>
## 1 1              PoliticsInt     0.483 1.08e-22
## 2 2              PoliticsInt     0.606 4.49e-25
## 3 3              PoliticsInt     0.842 4.01e-13
## 4 4              PoliticsInt     0.859 3.72e- 7

mydata %>% group_by(clusterKMeansF) %>% shapiro_test(Age)

## # A tibble: 4 × 4
##   clusterKMeansF variable statistic           p
##   <fct>          <chr>        <dbl>       <dbl>
## 1 1              Age          0.991 0.349      
## 2 2              Age          0.963 0.00000102 
## 3 3              Age          0.934 0.000000133
## 4 4              Age          0.976 0.147

mydata %>% group_by(clusterKMeansF) %>% shapiro_test(PplTrust)

## # A tibble: 4 × 4
##   clusterKMeansF variable statistic           p
##   <fct>          <chr>        <dbl>       <dbl>
## 1 1              PplTrust     0.959 0.0000370  
## 2 2              PplTrust     0.957 0.000000170
## 3 3              PplTrust     0.957 0.0000139  
## 4 4              PplTrust     0.929 0.000272

mydata %>% group_by(clusterKMeansF) %>% shapiro_test(Happiness)

## # A tibble: 4 × 4
##   clusterKMeansF variable  statistic        p
##   <fct>          <chr>         <dbl>    <dbl>
## 1 1              Happiness     0.918 1.59e- 8
## 2 2              Happiness     0.872 1.05e-14
## 3 3              Happiness     0.888 9.95e-11
## 4 4              Happiness     0.905 2.23e- 5

Normality is met only for 1st and 4th groups for Age variable. So we should use non-parametric tests

# Just to see how the distributions of variables look within groups.

ggplot(mydata, aes(x=ImpRich)) +
  geom_histogram(binwidth=1, colour="green") +
  facet_wrap(~clusterKMeansF, ncol=2)

ggplot(mydata, aes(x=InternetUse)) +
  geom_histogram(binwidth=60, colour="green") +
  facet_wrap(~clusterKMeansF, ncol=2)

ggplot(mydata, aes(x=PoliticsInt)) +
  geom_histogram(binwidth=1, colour="green") +
  facet_wrap(~clusterKMeansF, ncol=2)

ggplot(mydata, aes(x=Age)) +
  geom_histogram(binwidth=1, colour="green") +
  facet_wrap(~clusterKMeansF, ncol=2)

ggplot(mydata, aes(x=PplTrust)) +
  geom_histogram(binwidth=1, colour="green") +
  facet_wrap(~clusterKMeansF, ncol=2)

ggplot(mydata, aes(x=Happiness)) +
  geom_histogram(binwidth=1, colour="green") +
  facet_wrap(~clusterKMeansF, ncol=2)

kruskal.test(ImpRich ~ clusterKMeansF, data=mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  ImpRich by clusterKMeansF
## Kruskal-Wallis chi-squared = 450.64, df = 3, p-value < 2.2e-16

kruskal.test(InternetUse ~ clusterKMeansF, data=mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  InternetUse by clusterKMeansF
## Kruskal-Wallis chi-squared = 220.45, df = 3, p-value < 2.2e-16

kruskal.test(PoliticsInt ~ clusterKMeansF, data=mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  PoliticsInt by clusterKMeansF
## Kruskal-Wallis chi-squared = 394.95, df = 3, p-value < 2.2e-16

wilcox_test(InternetUse ~ clusterKMeansF, paired = FALSE, p.adjust.method = "bonferroni", data = mydata)

## # A tibble: 6 × 9
##   .y.         group1 group2    n1    n2 statistic        p    p.adj p.adj.signif
## * <chr>       <chr>  <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
## 1 InternetUse 1      2        180   286   26066.  8.16e- 1 1   e+ 0 ns          
## 2 InternetUse 1      3        180   191   14771   1.8 e- 2 1.08e- 1 ns          
## 3 InternetUse 1      4        180    79      13   1.08e-37 6.48e-37 ****        
## 4 InternetUse 2      3        286   191   23150   4   e- 3 2.6 e- 2 *           
## 5 InternetUse 2      4        286    79      32   2.12e-42 1.27e-41 ****        
## 6 InternetUse 3      4        191    79      25.5 2.90e-38 1.74e-37 ****

We use non-parametric test here as the normality is not met. And for all 3 variables we reject the H0 that location distributions for the variable are the same for each group. So variables quite successfully classify units into groups.

Then we do validation taking into account other variables which weren’t used for classification.

chisq_results <- chisq.test(mydata$GenderF, mydata$clusterKMeansF)
chisq_results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$GenderF and mydata$clusterKMeansF
## X-squared = 34.889, df = 3, p-value = 1.286e-07

addmargins(chisq_results$observed)

##               mydata$clusterKMeansF
## mydata$GenderF   1   2   3   4 Sum
##            M    88 104 122  37 351
##            F    92 182  69  42 385
##            Sum 180 286 191  79 736

round(addmargins(chisq_results$expected), 2)

##               mydata$clusterKMeansF
## mydata$GenderF      1      2      3     4 Sum
##            M    85.84 136.39  91.09 37.68 351
##            F    94.16 149.61  99.91 41.32 385
##            Sum 180.00 286.00 191.00 79.00 736

round(chisq_results$res, 2)

##               mydata$clusterKMeansF
## mydata$GenderF     1     2     3     4
##              M  0.23 -2.77  3.24 -0.11
##              F -0.22  2.65 -3.09  0.11

kruskal.test(Age ~ clusterKMeansF, data=mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Age by clusterKMeansF
## Kruskal-Wallis chi-squared = 72.949, df = 3, p-value = 9.967e-16

wilcox_test(Age ~ clusterKMeansF, paired = FALSE, p.adjust.method = "bonferroni", data = mydata)

## # A tibble: 6 × 9
##   .y.   group1 group2    n1    n2 statistic        p    p.adj p.adj.signif
## * <chr> <chr>  <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
## 1 Age   1      2        180   286    36116. 2.26e-13 1.36e-12 ****        
## 2 Age   1      3        180   191    24776. 2   e-13 1.20e-12 ****        
## 3 Age   1      4        180    79     9872. 6.51e- 7 3.91e- 6 ****        
## 4 Age   2      3        286   191    29068  2.34e- 1 1   e+ 0 ns          
## 5 Age   2      4        286    79    10330. 2.44e- 1 1   e+ 0 ns          
## 6 Age   3      4        191    79     6318  3.6 e- 2 2.14e- 1 ns

aggregate(mydata$Age, by = list(mydata$clusterKMeansF), FUN = "median")

##   Group.1  x
## 1       1 49
## 2       2 36
## 3       3 33
## 4       4 39

kruskal.test(PplTrust ~ clusterKMeansF, data=mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  PplTrust by clusterKMeansF
## Kruskal-Wallis chi-squared = 9.4093, df = 3, p-value = 0.02432

wilcox_test(PplTrust ~ clusterKMeansF, paired = FALSE, p.adjust.method = "bonferroni", data = mydata)

## # A tibble: 6 × 9
##   .y.      group1 group2    n1    n2 statistic     p p.adj p.adj.signif
## * <chr>    <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl> <chr>       
## 1 PplTrust 1      2        180   286    28239  0.075 0.448 ns          
## 2 PplTrust 1      3        180   191    18726  0.134 0.804 ns          
## 3 PplTrust 1      4        180    79     6430  0.217 1     ns          
## 4 PplTrust 2      3        286   191    27120. 0.895 1     ns          
## 5 PplTrust 2      4        286    79     9080  0.007 0.042 *           
## 6 PplTrust 3      4        191    79     6162. 0.017 0.101 ns

aggregate(mydata$PplTrust, by = list(mydata$clusterKMeansF), FUN = "median")

##   Group.1 x
## 1       1 5
## 2       2 5
## 3       3 5
## 4       4 5

kruskal.test(ClimateChanging ~ clusterKMeansF, data=mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  ClimateChanging by clusterKMeansF
## Kruskal-Wallis chi-squared = 0.041572, df = 3, p-value = 0.9978

aggregate(mydata$ClimateChanging, by = list(mydata$clusterKMeansF), FUN = "median")

##   Group.1 x
## 1       1 3
## 2       2 3
## 3       3 3
## 4       4 3

kruskal.test(Happiness ~ clusterKMeansF, data=mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Happiness by clusterKMeansF
## Kruskal-Wallis chi-squared = 5.3491, df = 3, p-value = 0.1479

aggregate(mydata$Happiness, by = list(mydata$clusterKMeansF), FUN = "median")

##   Group.1 x
## 1       1 8
## 2       2 8
## 3       3 8
## 4       4 8

I classified 1307 (736 after removing outliers and non-response units) Slovenian citizens was based on 3 standardized variables.

For hierarchical clustering Ward’s algorithm was used, and based on the analysis of the dendrogram i decided to classify units into 3 groups.

The classification was further optimized using the K-Means clustering.

Group 3 (26%, 191/736 of respondents) represents the second largest group of Slovenian citizens who don’t associate themselves with a person for whom it’s important to be rich and have a lot of money. There are more than expected number of men and less than expected number of women in this group (p<0.05).

Group 1 (24.4%, 180/736 of respondents) represents the third group of Slovenian citizens by amount. They are more interested in politics than average. Its median age is the highest comparing with other groups (p<0.05).

Group 4 (10,7%, 79/736 of respondents) represents the group that daily spends statistically significantly more time in the internet than other groups (p<0.001).

For all groups there is no statistical difference in level of worry about the climate change and level of happiness.

Yury Kapralov. Student ID 19619871. Clustering

2024-02-06