This study is based on the Student Academic Stress dataset and aimed to identify distinct stress profiles among students using unsupervised learning techniques, specifically Partitioning Around Medoids (PAM) and Hierarchical Clustering on Principal Components (HCPC). By optimizing the model to a 4-cluster solution within a PCA framework that captures 76% of the total variance, we achieved a Silhouette Width of 0.31, indicating a statistically reasonable structure. A key finding is that while overall stress intensity is relatively uniform across the population (\(p=0.1239\)), the underlying drivers particularly Peer Pressure (\(p=0.0126\)) serve as the primary differentiators between student groups.
The initial phase involves loading libraries for data analysis and clustering. The raw data requires meticulous cleaning, including the removal of non-predictive timestamps and the standardization of scales to ensure that higher-magnitude variables do not disproportionately influence the Euclidean distance calculations.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
tidyverse, cluster, factoextra, corrplot, psych,
fastDummies, fpc, gridExtra, Rtsne, FactoMineR, mclust
)
set.seed(42)
# Loading the data set
df <- read_csv("academic Stress level - maintainance 1.csv") %>%
select(-1) %>%
setNames(c("academic_stage", "peer_pressure", "academic_pressure_from_home",
"study_environment", "coping_strategy", "bad_habits",
"rating_of_academic_competition", "stress_index")) %>%
drop_na()
# Encoding categorical and ordinal variables
df_prepared <- df %>%
mutate(academic_stage_num = case_when(
academic_stage == "high school" ~ 1,
academic_stage == "undergraduate" ~ 2,
academic_stage == "post-graduate" ~ 3)) %>%
select(-academic_stage) %>%
dummy_cols(select_columns = c("study_environment", "coping_strategy", "bad_habits"),
remove_selected_columns = TRUE) %>%
rename_with(~ str_remove_all(., "coping_strategy_"))
# Final scaling for algorithm performance
df_scaled <- as.data.frame(scale(df_prepared))Before clustering, we examine how variables relate to one another. Using a hierarchical clustering order for the correlation matrix allows us to visualize “blocks” of correlated variables, which typically hint at the underlying dimensions that PCA will later extract.
# Visualizing correlations with appropriate margins
M <- cor(df_scaled)
corrplot(M, method = "color", type = "full",
tl.col = "black", tl.cex = 0.7, tl.srt = 45,
order = "hclust", mar = c(0,0,1,0),
title = "Variable Correlation Matrix")The data shows that peer pressure and high competition are the main reasons why students feel stressed. Interestingly, those who use logic to solve their problems are much less likely to have emotional breakdowns or cry.
The Hopkins statistic is calculated to determine if the data possesses a non-random clustering tendency. We then utilize the Elbow, Silhouette, and Gap Statistic methods to identify the optimal number of clusters (\(k\)).
# Hopkins statistic > 0.75 indicates high clusterability
hopkins <- get_clust_tendency(df_scaled, n = 50, graph = FALSE)
print(paste("Hopkins Statistic:", round(hopkins$hopkins_stat, 4)))## [1] "Hopkins Statistic: 0.7152"
# Optimal k search
elbow <- fviz_nbclust(df_scaled, pam, method = "wss") + labs(title = "Elbow Method")
sil <- fviz_nbclust(df_scaled, pam, method = "silhouette") + labs(title = "Silhouette Method")
gap_stat_values <- clusGap(df_scaled, FUN = pam, K.max = 10, B = 50)
gapstat <- fviz_gap_stat(gap_stat_values) + labs(title = "Gap Statistic")
grid.arrange(elbow, sil, gapstat, ncol = 3)A Hopkins score of 0.7152 proves that the patterns in student stress are real and have a strong natural tendency to form clusters. While mathematical tests suggested more clusters, we selected \(k=4\) because our small sample would otherwise be split into groups too tiny to be statistically reliable.
Survey data often contains redundant information. We apply Principal Component Analysis (PCA) with Varimax rotation to extract 6 components that explain the majority of the variance.
# Scree plot to verify Kaiser Criterion (Eigenvalues > 1)
pca_fit <- principal(df_scaled, nfactors = 6, rotate = "varimax")
plot(pca_fit$values, type = "b", main = "Scree Plot",
xlab = "Component Number", ylab = "Eigenvalue", pch = 19, col = "blue")
abline(h = 1, col = "red", lty = 2)## Principal Components Analysis
## Call: principal(r = df_scaled, nfactors = 6, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item RC1 RC2 RC3 RC5
## stress_index 4 0.86
## academic_pressure_from_home 2 0.66
## peer_pressure 1 0.64
## rating_of_academic_competition 3 0.62 0.42
## bad_habits_No 12 -0.99
## bad_habits_Yes 14 0.69
## bad_habits_prefer not to say 13 0.67
## Emotional breakdown (crying a lot) 10 -0.89
## Analyze the situation and handle it with intellect 9 0.78
## study_environment_disrupted 6 0.91
## academic_stage_num 5 0.43
## study_environment_Noisy 7
## study_environment_Peaceful 8 -0.65
## Social support (friends, family) 11
## RC4 RC6 h2 u2 com
## stress_index 0.77 0.230 1.1
## academic_pressure_from_home 0.51 0.488 1.3
## peer_pressure 0.59 0.406 1.9
## rating_of_academic_competition 0.61 0.392 2.0
## bad_habits_No 0.99 0.013 1.0
## bad_habits_Yes 0.66 0.339 1.9
## bad_habits_prefer not to say 0.59 0.411 1.6
## Emotional breakdown (crying a lot) 0.85 0.153 1.2
## Analyze the situation and handle it with intellect -0.56 0.94 0.059 1.9
## study_environment_disrupted 0.87 0.126 1.1
## academic_stage_num 0.48 0.515 3.7
## study_environment_Noisy 0.94 0.92 0.081 1.1
## study_environment_Peaceful -0.71 0.97 0.027 2.1
## Social support (friends, family) 0.92 0.87 0.133 1.0
##
## RC1 RC2 RC3 RC5 RC4 RC6
## SS loadings 2.07 1.98 1.90 1.66 1.55 1.47
## Proportion Var 0.15 0.14 0.14 0.12 0.11 0.11
## Cumulative Var 0.15 0.29 0.42 0.54 0.65 0.76
## Proportion Explained 0.19 0.19 0.18 0.16 0.15 0.14
## Cumulative Proportion 0.19 0.38 0.56 0.72 0.86 1.00
##
## Mean item complexity = 1.7
## Test of the hypothesis that 6 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.08
## with the empirical chi square 174.05 with prob < 1.2e-25
##
## Fit based upon off diagonal values = 0.86
Our PCA model captures 76% of the total variance in the dataset, which means these six factors represent most of the information about student stress. Following the Kaiser Criterion, we kept all six components because their eigenvalues are all significantly greater than 1.0. RC1 clearly represents “Academic Pressure”, as it is driven by the stress index, home pressure, and peer pressure. RC3 identifies “Coping Mechanisms”, showing a direct trade-off between using intellect and having emotional breakdowns. These findings prove that our chosen factors are statistically strong and provide a clear, simple way to distinguish between different types of students.
In the social sciences, where information is often less precise, it is not uncommon to consider a solution that accounts for 60 percent of the total variance as satisfactory, and in some instances even less. [1]
We compare the PAM model before and after PCA to ensure that the noise reduction improved cluster separation. Furthermore, we conduct a Bootstrap stability test using the Jaccard index to ensure the identified clusters are robust.
pam_pca <- pam(df_pca_scores, k = 4, metric = "euclidean")
print(paste("Avg Silhouette (Before PCA):", round(pam_raw$silinfo$avg.width, 4)))## [1] "Avg Silhouette (Before PCA): 0.2005"
## [1] "Avg Silhouette (After PCA): 0.26"
p1 <- fviz_cluster(pam_raw, data = df_scaled, geom = "point",
main = "Clustering before PCA") + theme_minimal()
p2 <- fviz_cluster(pam_pca, data = df_pca_scores, geom = "point",
main = "Clustering after PCA") + theme_minimal()
grid.arrange(p1, p2, ncol = 2)# Bootstrap Jaccard Means (Stability scores)
cluster_stability <- clusterboot(df_pca_scores, clustermethod = pamkCBI, k = 4, count = FALSE)
print("Bootstrap Jaccard Means per cluster:")## [1] "Bootstrap Jaccard Means per cluster:"
## [1] 0.6581565 0.8564189 0.7530132 0.7445793
The results show that using PCA successfully “cleaned” the data, increasing the Avg Silhouette from 0.20 to 0.26 and created better-defined clusters. Furthermore, all Jaccard indices are above the 0.60 it means clusters are stable.
Next we will use HCPC (Hierarchical Clustering on Principal Components) and compare it with PAM after PCA.
res.pca_fm <- PCA(df_scaled, ncp = 6, graph = FALSE)
res.hcpc <- HCPC(res.pca_fm, nb.clust = 4, graph = FALSE)
# Dendrogram analysis
fviz_dend(res.hcpc, cex = 0.5, main = "HCPC Dendrogram (K=4)")We evaluate the consistency between PAM and HCPC using the Adjusted Rand Index (ARI). The superior Silhouette score of HCPC (approx. 0.31) justifies its selection as the primary model for final student profiling
pam_labels <- pam_pca$clustering
hcpc_labels <- as.numeric(res.hcpc$data.clust$clust)
# Consistency check
ari_value <- adjustedRandIndex(pam_labels, hcpc_labels)
print(paste("Adjusted Rand Index (PAM vs HCPC):", round(ari_value, 4)))## [1] "Adjusted Rand Index (PAM vs HCPC): 0.4445"
## HCPC
## PAM 1 2 3 4
## 1 31 0 0 5
## 2 46 0 0 2
## 3 0 19 7 1
## 4 1 2 18 7
hcpc_coordinates <- res.pca_fm$ind$coord
dist_hcpc <- dist(hcpc_coordinates)
sil_hcpc <- silhouette(hcpc_labels, dist_hcpc)
print(paste("PAM (PCA) Avg Silhouette:", round(pam_pca$silinfo$avg.width, 4)))## [1] "PAM (PCA) Avg Silhouette: 0.26"
## [1] "HCPC Avg Silhouette: 0.3139"
The Biplot shows how specific variables (vectors) drive students (points) into clusters. T-SNE is a powerful visualization tool that maps complex, high-dimensional data into a simple 2D or 3D plots while keeping similar points close together. By using a Student-t distribution, it effectively reveals natural clusters and prevents points from crowding together in the center of the map.[2] We use t-SNE to confirm that the clusters form distinct groups in a lower-dimensional space.
# Variable impact biplot
fviz_pca_biplot(res.pca_fm, geom.ind = "point", fill.ind = df_final$cluster,
pointshape = 21, pointsize = 3, addEllipses = TRUE,
ellipse.type = "convex", col.var = "black", repel = TRUE,
title = "Distinct Student Stress Profiles") + theme_minimal()# t-SNE non-linear validation
df_jittered <- jitter(as.matrix(df_scaled), factor = 0.00001)
tsne_results <- Rtsne(df_jittered, perplexity = 20, check_duplicates = FALSE)
tsne_data <- data.frame(X = tsne_results$Y[,1], Y = tsne_results$Y[,2], Cluster = df_final$cluster)
ggplot(tsne_data, aes(x = X, y = Y, color = Cluster)) +
geom_point(size = 3, alpha = 0.7) +
stat_ellipse(level = 0.95, linetype = 2) +
labs(title = "t-SNE: Cluster Separation") + theme_minimal()The biplot shows that peer pressure, home pressure, and academic competition are all strongly connected to high stress levels, as their arrows point in the same direction. In contrast, using intellectual strategies is positioned as the direct opposite of emotional breakdowns and study disruptions, indicating that these behaviors and outcomes are negatively correlated.
The t-SNE visualization confirms the validity of the four-cluster model by revealing distinct, well-separated groups of students with high internal consistency.
We finalize the analysis by characterizing each cluster based on their mean stress levels and primary coping strategies. Kruskal-Wallis tests are employed to prove that the differences in Peer Pressure and Competition are statistically significant and not due to chance
segment_profile <- df_final %>%
group_by(cluster) %>%
summarise(
n = n(),
avg_stress = round(mean(stress_index), 2),
avg_peer_press = round(mean(peer_pressure), 2),
avg_competition = round(mean(rating_of_academic_competition), 2),
typical_coping = names(which.max(table(coping_strategy)))
) %>% arrange(desc(avg_stress))
print(segment_profile)## # A tibble: 4 × 6
## cluster n avg_stress avg_peer_press avg_competition typical_coping
## <fct> <int> <dbl> <dbl> <dbl> <chr>
## 1 4 15 4.07 3.47 3.47 Analyze the situation…
## 2 3 25 3.92 3.52 3.24 Emotional breakdown (…
## 3 2 21 3.67 2.62 3.1 Social support (frien…
## 4 1 78 3.6 2.95 3.67 Analyze the situation…
# Significance testing across clusters
target_vars <- c("peer_pressure", "academic_pressure_from_home",
"rating_of_academic_competition", "stress_index")
results_list <- list()
for (var in target_vars) {
test_res <- kruskal.test(as.formula(paste(var, "~ cluster")), data = df_final)
results_list[[var]] <- data.frame(Chi_Squared = round(test_res$statistic, 3),
p_value = round(test_res$p.value, 4))
}
significance_table <- do.call(rbind, results_list) %>% arrange(p_value)
print(significance_table)## Chi_Squared p_value
## peer_pressure 10.841 0.0126
## rating_of_academic_competition 7.604 0.0549
## stress_index 5.760 0.1239
## academic_pressure_from_home 4.032 0.2580
Cluster 1: “The Analytical Majority” — This is the largest group (\(n=78\)), characterized by high academic competition but a resilient, analytical approach to problem-solving.
Cluster 2: “The Socially Supported” — These students have the lowest peer pressure and are unique in relying on friends and family for support.
Cluster 3: “The Emotionally Overwhelmed” — These students face high peer pressure and are the most likely to experience emotional breakdowns (crying).
Cluster 4: “The High-Stress Solvers” — This group reports the highest average stress levels (\(4.07\)), but like Cluster 1, attempts to handle situations intellectually.
The Kruskal-Wallis test reveals that peer pressure is the most critical variable defining these groups (\(p = 0.0126\)), making it the primary driver of student differentiation. Academic competition is also a notable factor (\(p = 0.0549\)), while general stress and home pressure do not statistically distinguish the clusters as strongly.
The segmentation identified four distinct student clusters.
The most critical finding is the role of Peer Pressure (\(p = 0.0126\)), which differentiates students far more effectively than their reported stress intensity.
University interventions should move beyond general stress reduction and focus on social resilience training and improving physical study environments to support the most vulnerable groups.
[1] Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2019). Multivariate Data Analysis
[2] Van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE