Executive Summary

This study is based on the Student Academic Stress dataset and aimed to identify distinct stress profiles among students using unsupervised learning techniques, specifically Partitioning Around Medoids (PAM) and Hierarchical Clustering on Principal Components (HCPC). By optimizing the model to a 4-cluster solution within a PCA framework that captures 76% of the total variance, we achieved a Silhouette Width of 0.31, indicating a statistically reasonable structure. A key finding is that while overall stress intensity is relatively uniform across the population (\(p=0.1239\)), the underlying drivers particularly Peer Pressure (\(p=0.0126\)) serve as the primary differentiators between student groups.

Phase 1: Environment Setup & Data Preprocessing

The initial phase involves loading libraries for data analysis and clustering. The raw data requires meticulous cleaning, including the removal of non-predictive timestamps and the standardization of scales to ensure that higher-magnitude variables do not disproportionately influence the Euclidean distance calculations.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(
  tidyverse, cluster, factoextra, corrplot, psych, 
  fastDummies, fpc, gridExtra, Rtsne, FactoMineR, mclust
)


set.seed(42)

# Loading the data set
df <- read_csv("academic Stress level - maintainance 1.csv") %>%
  select(-1) %>% 
  setNames(c("academic_stage", "peer_pressure", "academic_pressure_from_home", 
             "study_environment", "coping_strategy", "bad_habits", 
             "rating_of_academic_competition", "stress_index")) %>%
  drop_na()

# Encoding categorical and ordinal variables
df_prepared <- df %>%
  mutate(academic_stage_num = case_when(
    academic_stage == "high school" ~ 1,
    academic_stage == "undergraduate" ~ 2,
    academic_stage == "post-graduate" ~ 3)) %>%
  select(-academic_stage) %>%
  dummy_cols(select_columns = c("study_environment", "coping_strategy", "bad_habits"),
             remove_selected_columns = TRUE) %>% 
  rename_with(~ str_remove_all(., "coping_strategy_"))

# Final scaling for algorithm performance
df_scaled <- as.data.frame(scale(df_prepared))

Correlation Structure

Before clustering, we examine how variables relate to one another. Using a hierarchical clustering order for the correlation matrix allows us to visualize “blocks” of correlated variables, which typically hint at the underlying dimensions that PCA will later extract.

# Visualizing correlations with appropriate margins
M <- cor(df_scaled)
corrplot(M, method = "color", type = "full", 
         tl.col = "black", tl.cex = 0.7, tl.srt = 45, 
         order = "hclust", mar = c(0,0,1,0),
         title = "Variable Correlation Matrix")

The data shows that peer pressure and high competition are the main reasons why students feel stressed. Interestingly, those who use logic to solve their problems are much less likely to have emotional breakdowns or cry.

Phase 2: Clusterability & Baseline Modeling

The Hopkins statistic is calculated to determine if the data possesses a non-random clustering tendency. We then utilize the Elbow, Silhouette, and Gap Statistic methods to identify the optimal number of clusters (\(k\)).

# Hopkins statistic > 0.75 indicates high clusterability
hopkins <- get_clust_tendency(df_scaled, n = 50, graph = FALSE)
print(paste("Hopkins Statistic:", round(hopkins$hopkins_stat, 4)))
## [1] "Hopkins Statistic: 0.7152"
# Optimal k search
elbow <- fviz_nbclust(df_scaled, pam, method = "wss") + labs(title = "Elbow Method")
sil   <- fviz_nbclust(df_scaled, pam, method = "silhouette") + labs(title = "Silhouette Method")
gap_stat_values <- clusGap(df_scaled, FUN = pam, K.max = 10, B = 50)
gapstat <- fviz_gap_stat(gap_stat_values) + labs(title = "Gap Statistic")

grid.arrange(elbow, sil, gapstat, ncol = 3)

# Initial PAM model for comparison
pam_raw <- pam(df_scaled, k = 4, metric = "euclidean")

A Hopkins score of 0.7152 proves that the patterns in student stress are real and have a strong natural tendency to form clusters. While mathematical tests suggested more clusters, we selected \(k=4\) because our small sample would otherwise be split into groups too tiny to be statistically reliable.

Phase 3: Dimensionality Reduction (PCA)

Survey data often contains redundant information. We apply Principal Component Analysis (PCA) with Varimax rotation to extract 6 components that explain the majority of the variance.

# Scree plot to verify Kaiser Criterion (Eigenvalues > 1)
pca_fit <- principal(df_scaled, nfactors = 6, rotate = "varimax")

plot(pca_fit$values, type = "b", main = "Scree Plot", 
     xlab = "Component Number", ylab = "Eigenvalue", pch = 19, col = "blue")
abline(h = 1, col = "red", lty = 2)

# Viewing rotated loadings (> 0.4 cut-off for clarity)
print.psych(pca_fit, cut = 0.4, sort = TRUE)
## Principal Components Analysis
## Call: principal(r = df_scaled, nfactors = 6, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                                                    item   RC1   RC2   RC3   RC5
## stress_index                                          4  0.86                  
## academic_pressure_from_home                           2  0.66                  
## peer_pressure                                         1  0.64                  
## rating_of_academic_competition                        3  0.62        0.42      
## bad_habits_No                                        12       -0.99            
## bad_habits_Yes                                       14        0.69            
## bad_habits_prefer not to say                         13        0.67            
## Emotional breakdown (crying a lot)                   10             -0.89      
## Analyze the situation and handle it with intellect    9              0.78      
## study_environment_disrupted                           6                    0.91
## academic_stage_num                                    5                    0.43
## study_environment_Noisy                               7                        
## study_environment_Peaceful                            8                   -0.65
## Social support (friends, family)                     11                        
##                                                      RC4   RC6   h2    u2 com
## stress_index                                                   0.77 0.230 1.1
## academic_pressure_from_home                                    0.51 0.488 1.3
## peer_pressure                                                  0.59 0.406 1.9
## rating_of_academic_competition                                 0.61 0.392 2.0
## bad_habits_No                                                  0.99 0.013 1.0
## bad_habits_Yes                                                 0.66 0.339 1.9
## bad_habits_prefer not to say                                   0.59 0.411 1.6
## Emotional breakdown (crying a lot)                             0.85 0.153 1.2
## Analyze the situation and handle it with intellect       -0.56 0.94 0.059 1.9
## study_environment_disrupted                                    0.87 0.126 1.1
## academic_stage_num                                             0.48 0.515 3.7
## study_environment_Noisy                             0.94       0.92 0.081 1.1
## study_environment_Peaceful                         -0.71       0.97 0.027 2.1
## Social support (friends, family)                          0.92 0.87 0.133 1.0
## 
##                        RC1  RC2  RC3  RC5  RC4  RC6
## SS loadings           2.07 1.98 1.90 1.66 1.55 1.47
## Proportion Var        0.15 0.14 0.14 0.12 0.11 0.11
## Cumulative Var        0.15 0.29 0.42 0.54 0.65 0.76
## Proportion Explained  0.19 0.19 0.18 0.16 0.15 0.14
## Cumulative Proportion 0.19 0.38 0.56 0.72 0.86 1.00
## 
## Mean item complexity =  1.7
## Test of the hypothesis that 6 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.08 
##  with the empirical chi square  174.05  with prob <  1.2e-25 
## 
## Fit based upon off diagonal values = 0.86
df_pca_scores <- data.frame(pca_fit$scores)

Our PCA model captures 76% of the total variance in the dataset, which means these six factors represent most of the information about student stress. Following the Kaiser Criterion, we kept all six components because their eigenvalues are all significantly greater than 1.0. RC1 clearly represents “Academic Pressure”, as it is driven by the stress index, home pressure, and peer pressure. RC3 identifies “Coping Mechanisms”, showing a direct trade-off between using intellect and having emotional breakdowns. These findings prove that our chosen factors are statistically strong and provide a clear, simple way to distinguish between different types of students.

In the social sciences, where information is often less precise, it is not uncommon to consider a solution that accounts for 60 percent of the total variance as satisfactory, and in some instances even less. [1]

Stability & PCA Impact Validation

We compare the PAM model before and after PCA to ensure that the noise reduction improved cluster separation. Furthermore, we conduct a Bootstrap stability test using the Jaccard index to ensure the identified clusters are robust.

pam_pca <- pam(df_pca_scores, k = 4, metric = "euclidean")

print(paste("Avg Silhouette (Before PCA):", round(pam_raw$silinfo$avg.width, 4)))
## [1] "Avg Silhouette (Before PCA): 0.2005"
print(paste("Avg Silhouette (After PCA):", round(pam_pca$silinfo$avg.width, 4)))
## [1] "Avg Silhouette (After PCA): 0.26"
p1 <- fviz_cluster(pam_raw, data = df_scaled, geom = "point", 
                   main = "Clustering before PCA") + theme_minimal()

p2 <- fviz_cluster(pam_pca, data = df_pca_scores, geom = "point", 
                   main = "Clustering after PCA") + theme_minimal()

grid.arrange(p1, p2, ncol = 2)

# Bootstrap Jaccard Means (Stability scores)
cluster_stability <- clusterboot(df_pca_scores, clustermethod = pamkCBI, k = 4, count = FALSE)
print("Bootstrap Jaccard Means per cluster:")
## [1] "Bootstrap Jaccard Means per cluster:"
print(cluster_stability$bootmean)
## [1] 0.6581565 0.8564189 0.7530132 0.7445793

The results show that using PCA successfully “cleaned” the data, increasing the Avg Silhouette from 0.20 to 0.26 and created better-defined clusters. Furthermore, all Jaccard indices are above the 0.60 it means clusters are stable.

Phase 4: Advanced Clustering (HCPC)

Next we will use HCPC (Hierarchical Clustering on Principal Components) and compare it with PAM after PCA.

res.pca_fm <- PCA(df_scaled, ncp = 6, graph = FALSE)
res.hcpc   <- HCPC(res.pca_fm, nb.clust = 4, graph = FALSE)

# Dendrogram analysis
fviz_dend(res.hcpc, cex = 0.5, main = "HCPC Dendrogram (K=4)")

Algorithm Comparison & Final Labeling

We evaluate the consistency between PAM and HCPC using the Adjusted Rand Index (ARI). The superior Silhouette score of HCPC (approx. 0.31) justifies its selection as the primary model for final student profiling

pam_labels  <- pam_pca$clustering
hcpc_labels <- as.numeric(res.hcpc$data.clust$clust)

# Consistency check
ari_value <- adjustedRandIndex(pam_labels, hcpc_labels)
print(paste("Adjusted Rand Index (PAM vs HCPC):", round(ari_value, 4)))
## [1] "Adjusted Rand Index (PAM vs HCPC): 0.4445"
# Cross-tabulation of cluster assignments
table(PAM = pam_labels, HCPC = hcpc_labels)
##    HCPC
## PAM  1  2  3  4
##   1 31  0  0  5
##   2 46  0  0  2
##   3  0 19  7  1
##   4  1  2 18  7
hcpc_coordinates <- res.pca_fm$ind$coord
dist_hcpc <- dist(hcpc_coordinates)

sil_hcpc <- silhouette(hcpc_labels, dist_hcpc)

print(paste("PAM (PCA) Avg Silhouette:", round(pam_pca$silinfo$avg.width, 4)))
## [1] "PAM (PCA) Avg Silhouette: 0.26"
print(paste("HCPC Avg Silhouette:", round(mean(sil_hcpc[, 3]), 4)))
## [1] "HCPC Avg Silhouette: 0.3139"
# Merging cluster labels into final dataframe
df_final <- df %>% mutate(cluster = as.factor(res.hcpc$data.clust$clust))

Phase 5: Visual Validation (Biplot & t-SNE)

The Biplot shows how specific variables (vectors) drive students (points) into clusters. T-SNE is a powerful visualization tool that maps complex, high-dimensional data into a simple 2D or 3D plots while keeping similar points close together. By using a Student-t distribution, it effectively reveals natural clusters and prevents points from crowding together in the center of the map.[2] We use t-SNE to confirm that the clusters form distinct groups in a lower-dimensional space.

# Variable impact biplot
fviz_pca_biplot(res.pca_fm, geom.ind = "point", fill.ind = df_final$cluster, 
                pointshape = 21, pointsize = 3, addEllipses = TRUE, 
                ellipse.type = "convex", col.var = "black", repel = TRUE,
                title = "Distinct Student Stress Profiles") + theme_minimal()

# t-SNE non-linear validation
df_jittered <- jitter(as.matrix(df_scaled), factor = 0.00001)
tsne_results <- Rtsne(df_jittered, perplexity = 20, check_duplicates = FALSE)
tsne_data <- data.frame(X = tsne_results$Y[,1], Y = tsne_results$Y[,2], Cluster = df_final$cluster)

ggplot(tsne_data, aes(x = X, y = Y, color = Cluster)) +
  geom_point(size = 3, alpha = 0.7) +
  stat_ellipse(level = 0.95, linetype = 2) +
  labs(title = "t-SNE: Cluster Separation") + theme_minimal()

The biplot shows that peer pressure, home pressure, and academic competition are all strongly connected to high stress levels, as their arrows point in the same direction. In contrast, using intellectual strategies is positioned as the direct opposite of emotional breakdowns and study disruptions, indicating that these behaviors and outcomes are negatively correlated.

The t-SNE visualization confirms the validity of the four-cluster model by revealing distinct, well-separated groups of students with high internal consistency.

Phase 6: Profiling & Statistical Significance

We finalize the analysis by characterizing each cluster based on their mean stress levels and primary coping strategies. Kruskal-Wallis tests are employed to prove that the differences in Peer Pressure and Competition are statistically significant and not due to chance

segment_profile <- df_final %>%
  group_by(cluster) %>%
  summarise(
    n = n(),
    avg_stress = round(mean(stress_index), 2),
    avg_peer_press = round(mean(peer_pressure), 2),
    avg_competition = round(mean(rating_of_academic_competition), 2),
    typical_coping = names(which.max(table(coping_strategy)))
  ) %>% arrange(desc(avg_stress))

print(segment_profile)
## # A tibble: 4 × 6
##   cluster     n avg_stress avg_peer_press avg_competition typical_coping        
##   <fct>   <int>      <dbl>          <dbl>           <dbl> <chr>                 
## 1 4          15       4.07           3.47            3.47 Analyze the situation…
## 2 3          25       3.92           3.52            3.24 Emotional breakdown (…
## 3 2          21       3.67           2.62            3.1  Social support (frien…
## 4 1          78       3.6            2.95            3.67 Analyze the situation…
# Significance testing across clusters
target_vars <- c("peer_pressure", "academic_pressure_from_home", 
                 "rating_of_academic_competition", "stress_index")

results_list <- list()
for (var in target_vars) {
  test_res <- kruskal.test(as.formula(paste(var, "~ cluster")), data = df_final)
  results_list[[var]] <- data.frame(Chi_Squared = round(test_res$statistic, 3), 
                                    p_value = round(test_res$p.value, 4))
}

significance_table <- do.call(rbind, results_list) %>% arrange(p_value)
print(significance_table)
##                                Chi_Squared p_value
## peer_pressure                       10.841  0.0126
## rating_of_academic_competition       7.604  0.0549
## stress_index                         5.760  0.1239
## academic_pressure_from_home          4.032  0.2580

Summary of Cluster Profiles

  • Cluster 1: “The Analytical Majority” — This is the largest group (\(n=78\)), characterized by high academic competition but a resilient, analytical approach to problem-solving.

  • Cluster 2: “The Socially Supported” — These students have the lowest peer pressure and are unique in relying on friends and family for support.

  • Cluster 3: “The Emotionally Overwhelmed” — These students face high peer pressure and are the most likely to experience emotional breakdowns (crying).

  • Cluster 4: “The High-Stress Solvers” — This group reports the highest average stress levels (\(4.07\)), but like Cluster 1, attempts to handle situations intellectually.

Statistical Significance

The Kruskal-Wallis test reveals that peer pressure is the most critical variable defining these groups (\(p = 0.0126\)), making it the primary driver of student differentiation. Academic competition is also a notable factor (\(p = 0.0549\)), while general stress and home pressure do not statistically distinguish the clusters as strongly.

Final Conclusions

The segmentation identified four distinct student clusters.

The most critical finding is the role of Peer Pressure (\(p = 0.0126\)), which differentiates students far more effectively than their reported stress intensity.

University interventions should move beyond general stress reduction and focus on social resilience training and improving physical study environments to support the most vulnerable groups.

References

[1] Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2019). Multivariate Data Analysis

[2] Van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE