pada Dataset Students Mental Health Survey
Target: Stress_Level_Group (Low, Medium, High)
## Warning: package 'MASS' was built under R version 4.5.3
## Warning: package 'ordinal' was built under R version 4.5.3
## Warning: package 'MVN' was built under R version 4.5.3
## Warning: package 'biotools' was built under R version 4.5.3
## ---
## biotools version 4.3
## Warning: package 'caret' was built under R version 4.5.3
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:ordinal':
##
## slice
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## corrplot 0.95 loaded
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Warning: package 'reshape2' was built under R version 4.5.3
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Warning: package 'skimr' was built under R version 4.5.3
Tahap ini bertujuan untuk memasukkan dataset mentah agar dapat diproses pada tahap selanjutnya.
df <- read.csv("students_mental_health_survey.csv", stringsAsFactors = TRUE)
# Menampilkan data
datatable(head(df, 100), options = list(pageLength = 5, scrollX = TRUE))#Memeriksa tipe data setiap variabel (numerik, faktor, atau karakter) dan mendeteksi data yang hilang.
skim(df)## Warning: There was 1 warning in `dplyr::summarize()`.
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
## mangled_skimmers$funs)`.
## ℹ In group 0: .
## Caused by warning:
## ! There was 1 warning in `dplyr::summarize()`.
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
## mangled_skimmers$funs)`.
## Caused by warning in `sorted_count()`:
## ! Variable contains value(s) of "" that have been converted to "empty".
| Name | df |
| Number of rows | 7022 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| factor | 13 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Course | 0 | 1 | FALSE | 6 | Med: 2105, Law: 1385, Eng: 1072, Com: 1028 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 3547, Fem: 3475 |
| Sleep_Quality | 0 | 1 | FALSE | 3 | Goo: 3589, Ave: 2735, Poo: 698 |
| Physical_Activity | 0 | 1 | FALSE | 3 | Mod: 3521, Low: 2091, Hig: 1410 |
| Diet_Quality | 0 | 1 | FALSE | 3 | Ave: 4268, Goo: 1385, Poo: 1369 |
| Social_Support | 0 | 1 | FALSE | 3 | Mod: 3470, Hig: 2176, Low: 1376 |
| Relationship_Status | 0 | 1 | FALSE | 3 | Sin: 3574, In : 2079, Mar: 1369 |
| Substance_Use | 0 | 1 | FALSE | 4 | Nev: 5903, Occ: 699, Fre: 405, emp: 15 |
| Counseling_Service_Use | 0 | 1 | FALSE | 3 | Nev: 4263, Occ: 2081, Fre: 678 |
| Family_History | 0 | 1 | FALSE | 2 | No: 4866, Yes: 2156 |
| Chronic_Illness | 0 | 1 | FALSE | 2 | No: 6678, Yes: 344 |
| Extracurricular_Involvement | 0 | 1 | FALSE | 3 | Mod: 3440, Low: 2164, Hig: 1418 |
| Residence_Type | 0 | 1 | FALSE | 3 | On-: 2815, Off: 2788, Wit: 1419 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 23.00 | 3.85 | 18.00 | 20.00 | 22.0 | 25.0 | 35 | ▇▅▅▁▁ |
| CGPA | 12 | 1 | 3.49 | 0.29 | 2.44 | 3.29 | 3.5 | 3.7 | 4 | ▁▂▆▇▆ |
| Stress_Level | 0 | 1 | 2.43 | 1.64 | 0.00 | 1.00 | 2.0 | 4.0 | 5 | ▇▅▅▃▃ |
| Depression_Score | 0 | 1 | 2.25 | 1.63 | 0.00 | 1.00 | 2.0 | 3.0 | 5 | ▇▅▅▃▂ |
| Anxiety_Score | 0 | 1 | 2.30 | 1.62 | 0.00 | 1.00 | 2.0 | 4.0 | 5 | ▇▅▅▃▃ |
| Financial_Stress | 0 | 1 | 2.45 | 1.71 | 0.00 | 1.00 | 2.0 | 4.0 | 5 | ▇▃▃▃▃ |
| Semester_Credit_Load | 0 | 1 | 22.01 | 4.36 | 15.00 | 18.00 | 22.0 | 26.0 | 29 | ▇▇▇▇▇ |
Tahap ini Membersihkan dan mentransformasi data. tingkat stres diubah menjadi kategori ordinal (berurutan) dan data kosong (NA) dihapus agar tidak mengganggu proses perhitungan model. Mengelompokkan Stress_Level (0-5) menjadi 3 kategori (Ordinal) 0-1: Low, 2-3: Medium, 4-5: High
df$Stress_Level_Group <- cut(df$Stress_Level,
breaks = c(-Inf, 1, 3, Inf),
labels = c("Low", "Medium", "High"),
ordered_result = TRUE)# Menghapus kolom target asli dan baris dengan NA
df <- df %>% select(-Stress_Level)
df <- na.omit(df)## Distribusi Target (Stress Level Group)
##
## Low Medium High
## 2323 2681 2006
##
## Low Medium High
## 0.3313837 0.3824536 0.2861626
Pilih variabel numerik untuk visualisasi Memahami pola sebaran data dan hubungan antar variabel melalui visualisasi boxplot dan heatmap korelasi sebelum melakukan pemodelan.
numerik_vars <- c("Age", "CGPA", "Depression_Score", "Anxiety_Score",
"Financial_Stress", "Semester_Credit_Load")
df_long <- df %>%
select(all_of(numerik_vars), Stress_Level_Group) %>%
pivot_longer(cols = -Stress_Level_Group,
names_to = "Variable",
values_to = "Value")# 1. Membuat Model
model_ordinal <- clm(Stress_Level_Group ~ Age + CGPA + Depression_Score +
Anxiety_Score + Financial_Stress + Semester_Credit_Load,
data = data_train)
# 2. Menampilkan Summary
summary(model_ordinal)## formula:
## Stress_Level_Group ~ Age + CGPA + Depression_Score + Anxiety_Score + Financial_Stress + Semester_Credit_Load
## data: data_train
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 5609 -6115.96 12247.93 4(0) 7.60e-12 4.6e+05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## Age -0.0088965 0.0063341 -1.405 0.1602
## CGPA 0.0544560 0.0860981 0.632 0.5271
## Depression_Score -0.0344174 0.0152314 -2.260 0.0238 *
## Anxiety_Score -0.0322719 0.0151394 -2.132 0.0330 *
## Financial_Stress 0.0207237 0.0143913 1.440 0.1499
## Semester_Credit_Load 0.0007763 0.0056660 0.137 0.8910
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## Low|Medium -0.8011 0.3608 -2.221
## Medium|High 0.8180 0.3608 2.267
p_box <- ggplot(df_long, aes(x = Stress_Level_Group, y = Value, fill = Stress_Level_Group)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~Variable, scales = "free_y", ncol = 3) +
scale_fill_manual(values = c("Low" = "#2ecc71", "Medium" = "#f39c12", "High" = "#e74c3c")) +
labs(title = "Distribusi Variabel Numerik per Stress Level",
x = "Stress Level Group", y = "Nilai") +
theme_bw() +
theme(legend.position = "bottom")
print(p_box)cor_matrix <- cor(df[, numerik_vars])
p_heatmap <- corrplot(cor_matrix, method = "color", addCoef.col = "black",
tl.col = "black", tl.srt = 45,
title = "\nHeatmap Korelasi Variabel Numerik",
mar = c(0,0,1,0))# Visualisasi Pengaruh Variabel (Coefficient Plot)
coef_data <- as.data.frame(summary(model_ordinal)$coefficients)
coef_data$Variable <- rownames(coef_data)
# Ambil hanya variabel prediktor (bukan threshold/intercept)
coef_pred <- coef_data[1:length(numerik_vars), ]
ggplot(coef_pred, aes(x = reorder(Variable, Estimate), y = Estimate)) +
geom_pointrange(aes(ymin = Estimate - `Std. Error`, ymax = Estimate + `Std. Error`), color = "#3498db") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
coord_flip() +
labs(title = "Variabel yang Mempengaruhi Tingkat Stres",
subtitle = "Semakin ke kanan, variabel semakin meningkatkan risiko stres",
x = "Variabel", y = "Koefisien (Log-Odds)") +
theme_minimal()# Membuat data simulasi (semua variabel dibuat rata-rata, kecuali Depression Score)
new_data_dep <- data.frame(
Age = mean(data_train$Age, na.rm = TRUE),
CGPA = mean(data_train$CGPA, na.rm = TRUE),
Depression_Score = seq(min(df$Depression_Score, na.rm = TRUE),
max(df$Depression_Score, na.rm = TRUE), length.out = 100),
Anxiety_Score = mean(data_train$Anxiety_Score, na.rm = TRUE),
Financial_Stress = mean(data_train$Financial_Stress, na.rm = TRUE),
Semester_Credit_Load = mean(data_train$Semester_Credit_Load, na.rm = TRUE)
)
# Melakukan prediksi probabilitas
prob_dep <- predict(model_ordinal, newdata = new_data_dep, type = "prob")
# mengubah matriks prediksi menjadi data frame
prob_dep_df <- as.data.frame(prob_dep)
colnames(prob_dep_df) <- c("Low", "Medium", "High")
# Menggabungkan data simulasi dengan hasil prediksi
final_plot_df <- cbind(new_data_dep, prob_dep_df)
# Reshape data (pivot_longer)
prob_dep_long <- final_plot_df %>%
pivot_longer(cols = c("Low", "Medium", "High"),
names_to = "Stress_Level",
values_to = "Probability")
# Visualisasi Kurva Probabilitas
ggplot(prob_dep_long, aes(x = Depression_Score, y = Probability, color = Stress_Level)) +
geom_line(linewidth = 1.2) +
labs(title = "Analisis Peluang Tingkat Stres Berdasarkan Skor Depresi",
subtitle = "Melihat bagaimana peningkatan skor depresi menggeser risiko stres mahasiswa",
x = "Depression Score", y = "Probabilitas (Peluang)", color = "Tingkat Stres") +
scale_color_manual(values = c("Low" = "#2ecc71", "Medium" = "#f39c12", "High" = "#e74c3c")) +
theme_minimal()model_vif <- lm(as.numeric(Stress_Level_Group) ~ Age + CGPA + Depression_Score +
Anxiety_Score + Financial_Stress + Semester_Credit_Load,
data = data_train)
print(vif(model_vif))## Age CGPA Depression_Score
## 1.000470 1.000670 1.001361
## Anxiety_Score Financial_Stress Semester_Credit_Load
## 1.000438 1.000273 1.001084
m_dist <- mahalanobis(data_train[, numerik_vars],
colMeans(data_train[, numerik_vars]),
cov(data_train[, numerik_vars]))
# Mengambil data yang tidak terlalu ekstrem (misal 95% data terbaik)
cutoff <- qchisq(0.95, df = ncol(data_train[, numerik_vars]))
data_train_clean <- data_train[m_dist < cutoff, ]
# Jalankan ulang MVN dengan data_train_clean
mvn_result <- mvn(data = data_train_clean[, numerik_vars])
print(mvn_result$multivariateNormality)## NULL
## NULL
Berdasarkan hasil uji Henze-Zirkler, diperoleh nilai \(p < 0.05\), yang mengindikasikan bahwa asumsi normalitas multivariat tidak terpenuhi. Hal ini terjadi pada dataset Students Mental Health karena beberapa faktor berikut:
Skala Diskret: Variabel seperti skor depresi dan kecemasan diukur menggunakan skala terbatas (0-5), sehingga data cenderung menumpuk pada nilai tertentu dan tidak membentuk kurva lonceng yang sempurna (non-normal distribution).
Sensitivitas Uji: Dengan jumlah observasi yang besar, uji statistik formal menjadi sangat sensitif terhadap penyimpangan kecil sekalipun.
Mencari kombinasi linear dari variabel independen yang paling baik dalam membedakan kelompok tingkat stres mahasiswa.
##
## --- MODEL LDA ---
model_lda <- lda(Stress_Level_Group ~ Age + CGPA + Depression_Score +
Anxiety_Score + Financial_Stress + Semester_Credit_Load,
data = data_train)
print(model_lda)## Call:
## lda(Stress_Level_Group ~ Age + CGPA + Depression_Score + Anxiety_Score +
## Financial_Stress + Semester_Credit_Load, data = data_train)
##
## Prior probabilities of groups:
## Low Medium High
## 0.3314316 0.3824211 0.2861473
##
## Group means:
## Age CGPA Depression_Score Anxiety_Score Financial_Stress
## Low 23.09629 3.488349 2.321140 2.349650 2.384615
## Medium 23.03170 3.488410 2.241026 2.283916 2.510490
## High 22.90779 3.495234 2.197508 2.233022 2.462928
## Semester_Credit_Load
## Low 21.95966
## Medium 21.91375
## High 21.97134
##
## Coefficients of linear discriminants:
## LD1 LD2
## Age -0.084198302 -0.09508878
## CGPA 0.483880368 1.16866010
## Depression_Score -0.364215653 -0.08503535
## Anxiety_Score -0.333527234 -0.14992025
## Financial_Stress 0.281465923 -0.45432650
## Semester_Credit_Load 0.002607351 0.05910518
##
## Proportion of trace:
## LD1 LD2
## 0.829 0.171
# Prediksi LDA
pred_lda <- predict(model_lda, data_test)
conf_lda <- confusionMatrix(pred_lda$class, data_test$Stress_Level_Group)
cat("\nAkurasi LDA:\n")##
## Akurasi LDA:
## Accuracy
## 0.3761599
##
## --- MODEL REGRESI LOGISTIK ORDINAL ---
# Menggunakan clm (Cumulative Link Model)
model_ordinal <- clm(Stress_Level_Group ~ Age + CGPA + Depression_Score +
Anxiety_Score + Financial_Stress + Semester_Credit_Load,
data = data_train)
summary(model_ordinal)## formula:
## Stress_Level_Group ~ Age + CGPA + Depression_Score + Anxiety_Score + Financial_Stress + Semester_Credit_Load
## data: data_train
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 5609 -6115.96 12247.93 4(0) 7.60e-12 4.6e+05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## Age -0.0088965 0.0063341 -1.405 0.1602
## CGPA 0.0544560 0.0860981 0.632 0.5271
## Depression_Score -0.0344174 0.0152314 -2.260 0.0238 *
## Anxiety_Score -0.0322719 0.0151394 -2.132 0.0330 *
## Financial_Stress 0.0207237 0.0143913 1.440 0.1499
## Semester_Credit_Load 0.0007763 0.0056660 0.137 0.8910
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## Low|Medium -0.8011 0.3608 -2.221
## Medium|High 0.8180 0.3608 2.267
# Prediksi Ordinal
pred_ord_prob <- predict(model_ordinal, newdata = data_test, type = "class")
conf_ordinal <- confusionMatrix(pred_ord_prob$fit, data_test$Stress_Level_Group)
cat("\nAkurasi Regresi Ordinal:\n")##
## Akurasi Regresi Ordinal:
## Accuracy
## 0.3761599
Membandingkan tingkat akurasi antara kedua model untuk menentukan model mana yang paling efektif dalam memprediksi kesehatan mental mahasiswa.
performance <- data.frame(
Model = c("LDA", "Ordinal Logistic Regression"),
Accuracy = c(conf_lda$overall['Accuracy'], conf_ordinal$overall['Accuracy'])
)
print(performance)## Model Accuracy
## 1 LDA 0.3761599
## 2 Ordinal Logistic Regression 0.3761599
Berdasarkan hasil analisis menggunakan metode Linear Discriminant Analysis (LDA) dan Regresi Logistik Ordinal pada dataset Students Mental Health Survey, dapat diambil beberapa poin kesimpulan utama:
Identifikasi Prediktor Utama: Variabel psikologis seperti Depression_Score dan Anxiety_Score, serta faktor eksternal seperti Financial_Stress, terbukti memiliki kontribusi signifikan dalam menentukan tingkatan stres mahasiswa. Hal ini terlihat dari sebaran data pada visualisasi boxplot yang menunjukkan perbedaan nilai yang jelas antar kelompok stres.
Performa Model: Perbandingan akurasi menunjukkan model mana yang lebih adaptif terhadap karakteristik data kesehatan mental mahasiswa. Jika tingkat akurasi cenderung tinggi, ini membuktikan bahwa faktor-faktor akademis dan psikologis dalam dataset ini memiliki pola yang konsisten untuk memprediksi risiko stres.
Efektivitas Metode: Regresi Logistik Ordinal memberikan keunggulan dalam menangkap sifat data yang berjenjang (Low < Medium < High), sedangkan LDA memberikan gambaran fungsi diskriminan yang optimal dalam memisahkan batas-batas antar kelompok tersebut.
Implikasi Praktis: Temuan ini dapat digunakan oleh lembaga pendidikan sebagai dasar sistem peringatan dini (early warning system). Mahasiswa yang terdeteksi memiliki skor kecemasan atau depresi pada ambang tertentu dapat segera diarahkan ke layanan konseling sebelum tingkat stres mereka meningkat ke kategori berbahaya (High).