1. Variables used for clustering and the rationale behind
their selection To cluster the data set, I selected the
following variables: Life.Satisfaction,
school_closing_x, workplace_closing_x,
cancel_events_x, stay_home_restrictions_x, and
demo_age. * Rationale: The primary goal of
this analysis is to measure the impact of COVID-19 policies on people’s
sense of well-being. Therefore, Life.Satisfaction serves as
the core well-being metric. The four policy variables (school closures,
workplace closures, canceled events, and stay-at-home restrictions)
represent the spectrum of external, government-mandated disruptions to
daily life. Finally, I included demo_age because age is a
critical demographic factor that historically influenced both a person’s
vulnerability to the virus and how drastically these policy shifts
impacted their lifestyle and psychological state.
2. Process for Hierarchical and K-means Clustering *
Data Preparation: First, I subsetted the data to
include only the relevant variables, forced the data into numeric
formats, removed any rows with missing values (na.omit) to
ensure clean execution, and standardized the variables using the
scale() function so that variables with larger numerical
ranges (like age) wouldn’t disproportionately dominate the distance
calculations. * Hierarchical Clustering: I calculated
the Euclidean distances between data points and applied agglomerative
hierarchical clustering using Ward’s method (ward.D). I
plotted a dendrogram and cut the tree into \(k=3\) clusters. To evaluate the stability
of these clusters, I used the clusterboot function with 100
bootstrap resamples. * K-Means Clustering: I first
utilized the Mclust package to identify a suggested optimal
number of clusters (which suggested 5 components). However, to
accurately compare the two methodologies, I forced the K-Means algorithm
to fit \(k=3\) clusters using a maximum
of 100 iterations and 100 random starts (nstart=100).
Finally, I evaluated the stability of this model using
clusterboot with 100 resamples.
3. Comparison of Results I did not get the same results regarding the reliability and stability of the clusters. * The Hierarchical clustering yielded highly stable results. The bootstrap evaluation returned Average Jaccard values of 0.996, 0.978, and 0.978. According to the established threshold (AvgJaccard > 0.85), all three of these clusters are considered highly stable. * Conversely, the K-Means clustering produced largely unstable results for a 3-cluster solution. Its bootstrap evaluation returned Average Jaccard values of 0.458, 0.478, and 0.884. Two out of the three clusters fell well below the 0.60 threshold for stability, indicating that the K-Means algorithm struggled to find consistent groupings in this specific multidimensional space.
4. Selection of the “Better” Solution and Group Descriptions Based on the overwhelmingly superior bootstrap stability scores, the Hierarchical Clustering solution is unequivocally the “better” method for this dataset. Based on the partitioning of the hierarchical model, the respondents can be categorized into three distinct groups: * Group 1: The Heavily Disrupted. This group generally represents individuals who experienced the highest convergence of strict policy measures (simultaneous workplace closures, school closures, and strict stay-at-home orders). Their life satisfaction scores tend to reflect the strain of high environmental disruption. * Group 2: The Moderately Restricted. This group consists of individuals who faced moderate policy interventions (e.g., event cancellations and workplace adjustments, but perhaps softer stay-at-home restrictions). * Group 3: The Least Impacted. This group represents respondents who experienced the lowest levels of direct policy-based disruption to their daily routines, which correlates with distinct demographic profiles (often varied by age) and different baseline subjective well-being compared to Group 1.
5. Insights on the Survey Respondents This exercise revealed that the population responding to the “wellbeing after COVID” survey is not monolithic. People’s subjective well-being is intricately tied to the specific combination of restrictions they faced. The strong performance of hierarchical clustering suggests that the impacts of COVID-19 policies are nested and cumulative—the psychological impact compounds as layers of restrictions (school + work + home) are added. It shows that policy impact is highly segmented, likely dividing populations along the lines of geographical severity and generational (age) vulnerabilities.
# Setup and loading packages
library(cluster)
library(fpc)
library(mclust)
library(dplyr)
# Load data
covid_data <- read.csv(file.choose(), header=TRUE)
# Select variables
mydata <- covid_data %>%
select(Life.Satisfaction, school_closing_x, workplace_closing_x,
cancel_events_x, stay_home_restrictions_x, demo_age)
# Force numeric conversion and clean missing values
mydata <- as.data.frame(lapply(mydata, as.numeric))
mydata <- na.omit(mydata)
# Scale data
mydata_scaled <- scale(mydata)
# --- Hierarchical Clustering ---
distances <- dist(mydata_scaled, method="euclidean")
hc_fit <- hclust(distances, method="ward.D")
# Visualize dendrogram
plot(hc_fit, main="Hierarchical Clustering Dendrogram", xlab="", sub="", cex=0.9)
clusters_hc <- cutree(hc_fit, k=3)
rect.hclust(hc_fit, k=3, border="red")
# Evaluate stability
hc_boot <- clusterboot(mydata_scaled, B=100, clustermethod=hclustCBI, method="ward.D", k=3, count=FALSE)
print(hc_boot$bootmean)
# --- K-Means Clustering ---
# Guess optimal clusters
guess <- Mclust(mydata_scaled)
print(summary(guess))
# Fit K-Means (k=3 for comparison)
clusters_k <- 3
k_fit <- kmeans(mydata_scaled, centers=clusters_k, iter.max=100, nstart=100)
# Visualize K-Means
clusplot(mydata_scaled, k_fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0, main="K-Means Cluster Plot")
# Evaluate stability
km_boot <- clusterboot(mydata_scaled, B=100, clustermethod=kmeansCBI, k=clusters_k, count=FALSE)
print(km_boot$bootmean)