pada Dataset Students Mental Health Survey

Target: Stress_Level_Group (Low, Medium, High)

# Install dan Load Library
library(MASS)      # Untuk LDA
## Warning: package 'MASS' was built under R version 4.5.3
library(ordinal)   # Untuk Regresi Logistik Ordinal (clm)
## Warning: package 'ordinal' was built under R version 4.5.3
library(MVN)       # Untuk Uji Multivariate Normality
## Warning: package 'MVN' was built under R version 4.5.3
library(biotools)  # Untuk Box's M Test
## Warning: package 'biotools' was built under R version 4.5.3
## ---
## biotools version 4.3
library(caret)     # Untuk Split Data & Evaluasi
## Warning: package 'caret' was built under R version 4.5.3
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:ordinal':
## 
##     slice
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(corrplot)
## corrplot 0.95 loaded
library(car)       # Untuk VIF
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.5.3
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(skimr)  
## Warning: package 'skimr' was built under R version 4.5.3
library(DT)

1 Load Data

Tahap ini bertujuan untuk memasukkan dataset mentah agar dapat diproses pada tahap selanjutnya.

df <- read.csv("students_mental_health_survey.csv", stringsAsFactors = TRUE)

# Menampilkan data 
datatable(head(df, 100), options = list(pageLength = 5, scrollX = TRUE))
#Memeriksa tipe data setiap variabel (numerik, faktor, atau karakter) dan mendeteksi data yang hilang.
skim(df)
## Warning: There was 1 warning in `dplyr::summarize()`.
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
##   mangled_skimmers$funs)`.
## ℹ In group 0: .
## Caused by warning:
## ! There was 1 warning in `dplyr::summarize()`.
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
##   mangled_skimmers$funs)`.
## Caused by warning in `sorted_count()`:
## ! Variable contains value(s) of "" that have been converted to "empty".
Data summary
Name df
Number of rows 7022
Number of columns 20
_______________________
Column type frequency:
factor 13
numeric 7
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Course 0 1 FALSE 6 Med: 2105, Law: 1385, Eng: 1072, Com: 1028
Gender 0 1 FALSE 2 Mal: 3547, Fem: 3475
Sleep_Quality 0 1 FALSE 3 Goo: 3589, Ave: 2735, Poo: 698
Physical_Activity 0 1 FALSE 3 Mod: 3521, Low: 2091, Hig: 1410
Diet_Quality 0 1 FALSE 3 Ave: 4268, Goo: 1385, Poo: 1369
Social_Support 0 1 FALSE 3 Mod: 3470, Hig: 2176, Low: 1376
Relationship_Status 0 1 FALSE 3 Sin: 3574, In : 2079, Mar: 1369
Substance_Use 0 1 FALSE 4 Nev: 5903, Occ: 699, Fre: 405, emp: 15
Counseling_Service_Use 0 1 FALSE 3 Nev: 4263, Occ: 2081, Fre: 678
Family_History 0 1 FALSE 2 No: 4866, Yes: 2156
Chronic_Illness 0 1 FALSE 2 No: 6678, Yes: 344
Extracurricular_Involvement 0 1 FALSE 3 Mod: 3440, Low: 2164, Hig: 1418
Residence_Type 0 1 FALSE 3 On-: 2815, Off: 2788, Wit: 1419

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 23.00 3.85 18.00 20.00 22.0 25.0 35 ▇▅▅▁▁
CGPA 12 1 3.49 0.29 2.44 3.29 3.5 3.7 4 ▁▂▆▇▆
Stress_Level 0 1 2.43 1.64 0.00 1.00 2.0 4.0 5 ▇▅▅▃▃
Depression_Score 0 1 2.25 1.63 0.00 1.00 2.0 3.0 5 ▇▅▅▃▂
Anxiety_Score 0 1 2.30 1.62 0.00 1.00 2.0 4.0 5 ▇▅▅▃▃
Financial_Stress 0 1 2.45 1.71 0.00 1.00 2.0 4.0 5 ▇▃▃▃▃
Semester_Credit_Load 0 1 22.01 4.36 15.00 18.00 22.0 26.0 29 ▇▇▇▇▇

2 Preprocessing

Tahap ini Membersihkan dan mentransformasi data. tingkat stres diubah menjadi kategori ordinal (berurutan) dan data kosong (NA) dihapus agar tidak mengganggu proses perhitungan model. Mengelompokkan Stress_Level (0-5) menjadi 3 kategori (Ordinal) 0-1: Low, 2-3: Medium, 4-5: High

df$Stress_Level_Group <- cut(df$Stress_Level, 
                             breaks = c(-Inf, 1, 3, Inf), 
                             labels = c("Low", "Medium", "High"),
                             ordered_result = TRUE)
# Menghapus kolom target asli dan baris dengan NA
df <- df %>% select(-Stress_Level)
df <- na.omit(df)
cat("Distribusi Target (Stress Level Group)\n")
## Distribusi Target (Stress Level Group)
print(table(df$Stress_Level_Group))
## 
##    Low Medium   High 
##   2323   2681   2006
print(prop.table(table(df$Stress_Level_Group)))
## 
##       Low    Medium      High 
## 0.3313837 0.3824536 0.2861626

3 Split Data (80% Train, 20% Test)

set.seed(42)
idx_train  <- createDataPartition(df$Stress_Level_Group, p = 0.8, list = FALSE)
data_train <- df[idx_train, ]
data_test  <- df[-idx_train, ]

4 Eksplorasi Data (EDA)

Pilih variabel numerik untuk visualisasi Memahami pola sebaran data dan hubungan antar variabel melalui visualisasi boxplot dan heatmap korelasi sebelum melakukan pemodelan.

numerik_vars <- c("Age", "CGPA", "Depression_Score", "Anxiety_Score", 
                  "Financial_Stress", "Semester_Credit_Load")

df_long <- df %>%
  select(all_of(numerik_vars), Stress_Level_Group) %>%
  pivot_longer(cols = -Stress_Level_Group,
               names_to = "Variable",
               values_to = "Value")
# 1. Membuat Model
model_ordinal <- clm(Stress_Level_Group ~ Age + CGPA + Depression_Score + 
                       Anxiety_Score + Financial_Stress + Semester_Credit_Load, 
                     data = data_train)

# 2. Menampilkan Summary
summary(model_ordinal)
## formula: 
## Stress_Level_Group ~ Age + CGPA + Depression_Score + Anxiety_Score + Financial_Stress + Semester_Credit_Load
## data:    data_train
## 
##  link  threshold nobs logLik   AIC      niter max.grad cond.H 
##  logit flexible  5609 -6115.96 12247.93 4(0)  7.60e-12 4.6e+05
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)  
## Age                  -0.0088965  0.0063341  -1.405   0.1602  
## CGPA                  0.0544560  0.0860981   0.632   0.5271  
## Depression_Score     -0.0344174  0.0152314  -2.260   0.0238 *
## Anxiety_Score        -0.0322719  0.0151394  -2.132   0.0330 *
## Financial_Stress      0.0207237  0.0143913   1.440   0.1499  
## Semester_Credit_Load  0.0007763  0.0056660   0.137   0.8910  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##             Estimate Std. Error z value
## Low|Medium   -0.8011     0.3608  -2.221
## Medium|High   0.8180     0.3608   2.267

5 Plot Boxplot

p_box <- ggplot(df_long, aes(x = Stress_Level_Group, y = Value, fill = Stress_Level_Group)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~Variable, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = c("Low" = "#2ecc71", "Medium" = "#f39c12", "High" = "#e74c3c")) +
  labs(title = "Distribusi Variabel Numerik per Stress Level",
       x = "Stress Level Group", y = "Nilai") +
  theme_bw() +
  theme(legend.position = "bottom")
print(p_box)

6 Plot Heatmap Korelasi

cor_matrix <- cor(df[, numerik_vars])
p_heatmap <- corrplot(cor_matrix, method = "color", addCoef.col = "black", 
                       tl.col = "black", tl.srt = 45, 
                       title = "\nHeatmap Korelasi Variabel Numerik",
                       mar = c(0,0,1,0))

# Visualisasi Pengaruh Variabel (Coefficient Plot)
coef_data <- as.data.frame(summary(model_ordinal)$coefficients)
coef_data$Variable <- rownames(coef_data)
# Ambil hanya variabel prediktor (bukan threshold/intercept)
coef_pred <- coef_data[1:length(numerik_vars), ]

ggplot(coef_pred, aes(x = reorder(Variable, Estimate), y = Estimate)) +
  geom_pointrange(aes(ymin = Estimate - `Std. Error`, ymax = Estimate + `Std. Error`), color = "#3498db") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  coord_flip() +
  labs(title = "Variabel yang Mempengaruhi Tingkat Stres",
       subtitle = "Semakin ke kanan, variabel semakin meningkatkan risiko stres",
       x = "Variabel", y = "Koefisien (Log-Odds)") +
  theme_minimal()

7 Prediksi probabilitas berdasarkan Depression Score

# Membuat data simulasi (semua variabel dibuat rata-rata, kecuali Depression Score)
new_data_dep <- data.frame(
  Age = mean(data_train$Age, na.rm = TRUE),
  CGPA = mean(data_train$CGPA, na.rm = TRUE),
  Depression_Score = seq(min(df$Depression_Score, na.rm = TRUE), 
                         max(df$Depression_Score, na.rm = TRUE), length.out = 100),
  Anxiety_Score = mean(data_train$Anxiety_Score, na.rm = TRUE),
  Financial_Stress = mean(data_train$Financial_Stress, na.rm = TRUE),
  Semester_Credit_Load = mean(data_train$Semester_Credit_Load, na.rm = TRUE)
)

# Melakukan prediksi probabilitas
prob_dep <- predict(model_ordinal, newdata = new_data_dep, type = "prob")

# mengubah matriks prediksi menjadi data frame 
prob_dep_df <- as.data.frame(prob_dep)
colnames(prob_dep_df) <- c("Low", "Medium", "High")  

# Menggabungkan data simulasi dengan hasil prediksi
final_plot_df <- cbind(new_data_dep, prob_dep_df)

# Reshape data (pivot_longer)
prob_dep_long <- final_plot_df %>%
  pivot_longer(cols = c("Low", "Medium", "High"), 
               names_to = "Stress_Level", 
               values_to = "Probability")

# Visualisasi Kurva Probabilitas
ggplot(prob_dep_long, aes(x = Depression_Score, y = Probability, color = Stress_Level)) +
  geom_line(linewidth = 1.2) +
  labs(title = "Analisis Peluang Tingkat Stres Berdasarkan Skor Depresi",
       subtitle = "Melihat bagaimana peningkatan skor depresi menggeser risiko stres mahasiswa",
       x = "Depression Score", y = "Probabilitas (Peluang)", color = "Tingkat Stres") +
  scale_color_manual(values = c("Low" = "#2ecc71", "Medium" = "#f39c12", "High" = "#e74c3c")) +
  theme_minimal()

8 Cek Multicollinearity (VIF)

model_vif <- lm(as.numeric(Stress_Level_Group) ~ Age + CGPA + Depression_Score + 
                  Anxiety_Score + Financial_Stress + Semester_Credit_Load, 
                data = data_train)
print(vif(model_vif))
##                  Age                 CGPA     Depression_Score 
##             1.000470             1.000670             1.001361 
##        Anxiety_Score     Financial_Stress Semester_Credit_Load 
##             1.000438             1.000273             1.001084

9 Menghitung jarak Mahalanobis untuk mencari outlier

m_dist <- mahalanobis(data_train[, numerik_vars], 
                      colMeans(data_train[, numerik_vars]), 
                      cov(data_train[, numerik_vars]))

# Mengambil data yang tidak terlalu ekstrem (misal 95% data terbaik)
cutoff <- qchisq(0.95, df = ncol(data_train[, numerik_vars]))
data_train_clean <- data_train[m_dist < cutoff, ]

# Jalankan ulang MVN dengan data_train_clean
mvn_result <- mvn(data = data_train_clean[, numerik_vars])
print(mvn_result$multivariateNormality)
## NULL

10 Cek Multivariate Normality (Asumsi LDA)

mvn_result <- mvn(data = data_train[, numerik_vars])
print(mvn_result$multivariateNormality)
## NULL

Berdasarkan hasil uji Henze-Zirkler, diperoleh nilai \(p < 0.05\), yang mengindikasikan bahwa asumsi normalitas multivariat tidak terpenuhi. Hal ini terjadi pada dataset Students Mental Health karena beberapa faktor berikut:

Skala Diskret: Variabel seperti skor depresi dan kecemasan diukur menggunakan skala terbatas (0-5), sehingga data cenderung menumpuk pada nilai tertentu dan tidak membentuk kurva lonceng yang sempurna (non-normal distribution).

Sensitivitas Uji: Dengan jumlah observasi yang besar, uji statistik formal menjadi sangat sensitif terhadap penyimpangan kecil sekalipun.

11 Implementasi Model 1: Linear Discriminant Analysis (LDA)

Mencari kombinasi linear dari variabel independen yang paling baik dalam membedakan kelompok tingkat stres mahasiswa.

cat("\n--- MODEL LDA ---\n")
## 
## --- MODEL LDA ---
model_lda <- lda(Stress_Level_Group ~ Age + CGPA + Depression_Score + 
                   Anxiety_Score + Financial_Stress + Semester_Credit_Load, 
                 data = data_train)
print(model_lda)
## Call:
## lda(Stress_Level_Group ~ Age + CGPA + Depression_Score + Anxiety_Score + 
##     Financial_Stress + Semester_Credit_Load, data = data_train)
## 
## Prior probabilities of groups:
##       Low    Medium      High 
## 0.3314316 0.3824211 0.2861473 
## 
## Group means:
##             Age     CGPA Depression_Score Anxiety_Score Financial_Stress
## Low    23.09629 3.488349         2.321140      2.349650         2.384615
## Medium 23.03170 3.488410         2.241026      2.283916         2.510490
## High   22.90779 3.495234         2.197508      2.233022         2.462928
##        Semester_Credit_Load
## Low                21.95966
## Medium             21.91375
## High               21.97134
## 
## Coefficients of linear discriminants:
##                               LD1         LD2
## Age                  -0.084198302 -0.09508878
## CGPA                  0.483880368  1.16866010
## Depression_Score     -0.364215653 -0.08503535
## Anxiety_Score        -0.333527234 -0.14992025
## Financial_Stress      0.281465923 -0.45432650
## Semester_Credit_Load  0.002607351  0.05910518
## 
## Proportion of trace:
##   LD1   LD2 
## 0.829 0.171
# Prediksi LDA
pred_lda <- predict(model_lda, data_test)
conf_lda <- confusionMatrix(pred_lda$class, data_test$Stress_Level_Group)
cat("\nAkurasi LDA:\n")
## 
## Akurasi LDA:
print(conf_lda$overall['Accuracy'])
##  Accuracy 
## 0.3761599

12 Implementasi Model 2: Regresi Logistik Ordinal

cat("\n--- MODEL REGRESI LOGISTIK ORDINAL ---\n")
## 
## --- MODEL REGRESI LOGISTIK ORDINAL ---
# Menggunakan clm (Cumulative Link Model)
model_ordinal <- clm(Stress_Level_Group ~ Age + CGPA + Depression_Score + 
                       Anxiety_Score + Financial_Stress + Semester_Credit_Load, 
                     data = data_train)
summary(model_ordinal)
## formula: 
## Stress_Level_Group ~ Age + CGPA + Depression_Score + Anxiety_Score + Financial_Stress + Semester_Credit_Load
## data:    data_train
## 
##  link  threshold nobs logLik   AIC      niter max.grad cond.H 
##  logit flexible  5609 -6115.96 12247.93 4(0)  7.60e-12 4.6e+05
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)  
## Age                  -0.0088965  0.0063341  -1.405   0.1602  
## CGPA                  0.0544560  0.0860981   0.632   0.5271  
## Depression_Score     -0.0344174  0.0152314  -2.260   0.0238 *
## Anxiety_Score        -0.0322719  0.0151394  -2.132   0.0330 *
## Financial_Stress      0.0207237  0.0143913   1.440   0.1499  
## Semester_Credit_Load  0.0007763  0.0056660   0.137   0.8910  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##             Estimate Std. Error z value
## Low|Medium   -0.8011     0.3608  -2.221
## Medium|High   0.8180     0.3608   2.267
# Prediksi Ordinal
pred_ord_prob <- predict(model_ordinal, newdata = data_test, type = "class")
conf_ordinal <- confusionMatrix(pred_ord_prob$fit, data_test$Stress_Level_Group)
cat("\nAkurasi Regresi Ordinal:\n")
## 
## Akurasi Regresi Ordinal:
print(conf_ordinal$overall['Accuracy'])
##  Accuracy 
## 0.3761599

13 Perbandingan Performa

Membandingkan tingkat akurasi antara kedua model untuk menentukan model mana yang paling efektif dalam memprediksi kesehatan mental mahasiswa.

performance <- data.frame(
  Model = c("LDA", "Ordinal Logistic Regression"),
  Accuracy = c(conf_lda$overall['Accuracy'], conf_ordinal$overall['Accuracy'])
)
print(performance)
##                         Model  Accuracy
## 1                         LDA 0.3761599
## 2 Ordinal Logistic Regression 0.3761599

14 Kesimpulan

Berdasarkan hasil analisis menggunakan metode Linear Discriminant Analysis (LDA) dan Regresi Logistik Ordinal pada dataset Students Mental Health Survey, dapat diambil beberapa poin kesimpulan utama:

Identifikasi Prediktor Utama: Variabel psikologis seperti Depression_Score dan Anxiety_Score, serta faktor eksternal seperti Financial_Stress, terbukti memiliki kontribusi signifikan dalam menentukan tingkatan stres mahasiswa. Hal ini terlihat dari sebaran data pada visualisasi boxplot yang menunjukkan perbedaan nilai yang jelas antar kelompok stres.

Performa Model: Perbandingan akurasi menunjukkan model mana yang lebih adaptif terhadap karakteristik data kesehatan mental mahasiswa. Jika tingkat akurasi cenderung tinggi, ini membuktikan bahwa faktor-faktor akademis dan psikologis dalam dataset ini memiliki pola yang konsisten untuk memprediksi risiko stres.

Efektivitas Metode: Regresi Logistik Ordinal memberikan keunggulan dalam menangkap sifat data yang berjenjang (Low < Medium < High), sedangkan LDA memberikan gambaran fungsi diskriminan yang optimal dalam memisahkan batas-batas antar kelompok tersebut.

Implikasi Praktis: Temuan ini dapat digunakan oleh lembaga pendidikan sebagai dasar sistem peringatan dini (early warning system). Mahasiswa yang terdeteksi memiliki skor kecemasan atau depresi pada ambang tertentu dapat segera diarahkan ke layanan konseling sebelum tingkat stres mereka meningkat ke kategori berbahaya (High).