This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Load library yang dibutuhkan
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Baca dataset
data <- read.csv("C:\\Users\\DELL VOSTRO\\OneDrive\\Documents\\anmul\\students_mental_health_survey.csv")
# Lihat 5 baris pertama
head(data)
## Age Course Gender CGPA Stress_Level Depression_Score Anxiety_Score
## 1 25 Others Male 3.56 3 3 2
## 2 24 Engineering Female 2.44 0 3 0
## 3 19 Business Female 3.74 4 0 3
## 4 19 Computer Science Male 3.65 2 1 0
## 5 18 Business Male 3.40 3 3 4
## 6 21 Medical Female 3.35 2 4 3
## Sleep_Quality Physical_Activity Diet_Quality Social_Support
## 1 Good Moderate Good Moderate
## 2 Average Low Average Low
## 3 Good Low Average Moderate
## 4 Average Low Average Moderate
## 5 Good Low Average High
## 6 Good Moderate Good High
## Relationship_Status Substance_Use Counseling_Service_Use Family_History
## 1 Married Never Never No
## 2 Single Occasionally Occasionally No
## 3 In a Relationship Never Occasionally No
## 4 Single Never No
## 5 Married Never Never No
## 6 Single Never Never No
## Chronic_Illness Financial_Stress Extracurricular_Involvement
## 1 No 2 Moderate
## 2 No 3 Low
## 3 No 4 High
## 4 No 4 Moderate
## 5 Yes 0 High
## 6 No 5 Moderate
## Semester_Credit_Load Residence_Type
## 1 17 On-Campus
## 2 27 On-Campus
## 3 15 On-Campus
## 4 20 Off-Campus
## 5 23 On-Campus
## 6 19 Off-Campus
# Lihat struktur data
str(data)
## 'data.frame': 7022 obs. of 20 variables:
## $ Age : int 25 24 19 19 18 21 18 21 24 19 ...
## $ Course : chr "Others" "Engineering" "Business" "Computer Science" ...
## $ Gender : chr "Male" "Female" "Female" "Male" ...
## $ CGPA : num 3.56 2.44 3.74 3.65 3.4 3.35 3.65 3.4 3.8 3.05 ...
## $ Stress_Level : int 3 0 4 2 3 2 2 0 3 2 ...
## $ Depression_Score : int 3 3 0 1 3 4 2 3 2 5 ...
## $ Anxiety_Score : int 2 0 3 0 4 3 5 3 1 0 ...
## $ Sleep_Quality : chr "Good" "Average" "Good" "Average" ...
## $ Physical_Activity : chr "Moderate" "Low" "Low" "Low" ...
## $ Diet_Quality : chr "Good" "Average" "Average" "Average" ...
## $ Social_Support : chr "Moderate" "Low" "Moderate" "Moderate" ...
## $ Relationship_Status : chr "Married" "Single" "In a Relationship" "Single" ...
## $ Substance_Use : chr "Never" "Occasionally" "Never" "" ...
## $ Counseling_Service_Use : chr "Never" "Occasionally" "Occasionally" "Never" ...
## $ Family_History : chr "No" "No" "No" "No" ...
## $ Chronic_Illness : chr "No" "No" "No" "No" ...
## $ Financial_Stress : int 2 3 4 4 0 5 4 3 2 1 ...
## $ Extracurricular_Involvement: chr "Moderate" "Low" "High" "Moderate" ...
## $ Semester_Credit_Load : int 17 27 15 20 23 19 20 23 28 27 ...
## $ Residence_Type : chr "On-Campus" "On-Campus" "On-Campus" "Off-Campus" ...
# Lihat ringkasan statistik
summary(data)
## Age Course Gender CGPA
## Min. :18 Length:7022 Length:7022 Min. :2.440
## 1st Qu.:20 Class :character Class :character 1st Qu.:3.290
## Median :22 Mode :character Mode :character Median :3.500
## Mean :23 Mean :3.491
## 3rd Qu.:25 3rd Qu.:3.700
## Max. :35 Max. :4.000
## NA's :12
## Stress_Level Depression_Score Anxiety_Score Sleep_Quality
## Min. :0.000 Min. :0.000 Min. :0.0 Length:7022
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.0 Class :character
## Median :2.000 Median :2.000 Median :2.0 Mode :character
## Mean :2.428 Mean :2.254 Mean :2.3
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:4.0
## Max. :5.000 Max. :5.000 Max. :5.0
##
## Physical_Activity Diet_Quality Social_Support Relationship_Status
## Length:7022 Length:7022 Length:7022 Length:7022
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Substance_Use Counseling_Service_Use Family_History
## Length:7022 Length:7022 Length:7022
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Chronic_Illness Financial_Stress Extracurricular_Involvement
## Length:7022 Min. :0.000 Length:7022
## Class :character 1st Qu.:1.000 Class :character
## Mode :character Median :2.000 Mode :character
## Mean :2.453
## 3rd Qu.:4.000
## Max. :5.000
##
## Semester_Credit_Load Residence_Type
## Min. :15.00 Length:7022
## 1st Qu.:18.00 Class :character
## Median :22.00 Mode :character
## Mean :22.01
## 3rd Qu.:26.00
## Max. :29.00
##
# Cek apakah ada baris yang duplikat (seluruh kolom sama persis)
duplikat <- duplicated(data)
# Tampilkan jumlah baris duplikat
sum(duplikat)
## [1] 0
#mengecek missing value
colSums(is.na(data))
## Age Course
## 0 0
## Gender CGPA
## 0 12
## Stress_Level Depression_Score
## 0 0
## Anxiety_Score Sleep_Quality
## 0 0
## Physical_Activity Diet_Quality
## 0 0
## Social_Support Relationship_Status
## 0 0
## Substance_Use Counseling_Service_Use
## 0 0
## Family_History Chronic_Illness
## 0 0
## Financial_Stress Extracurricular_Involvement
## 0 0
## Semester_Credit_Load Residence_Type
## 0 0
# Cek jumlah string kosong ("") per kolom
sapply(data, function(x) sum(x == "", na.rm = TRUE))
## Age Course
## 0 0
## Gender CGPA
## 0 0
## Stress_Level Depression_Score
## 0 0
## Anxiety_Score Sleep_Quality
## 0 0
## Physical_Activity Diet_Quality
## 0 0
## Social_Support Relationship_Status
## 0 0
## Substance_Use Counseling_Service_Use
## 15 0
## Family_History Chronic_Illness
## 0 0
## Financial_Stress Extracurricular_Involvement
## 0 0
## Semester_Credit_Load Residence_Type
## 0 0
# Gabungkan keduanya
sapply(data, function(x) sum(is.na(x) | x == "", na.rm = TRUE))
## Age Course
## 0 0
## Gender CGPA
## 0 12
## Stress_Level Depression_Score
## 0 0
## Anxiety_Score Sleep_Quality
## 0 0
## Physical_Activity Diet_Quality
## 0 0
## Social_Support Relationship_Status
## 0 0
## Substance_Use Counseling_Service_Use
## 15 0
## Family_History Chronic_Illness
## 0 0
## Financial_Stress Extracurricular_Involvement
## 0 0
## Semester_Credit_Load Residence_Type
## 0 0
# Imputasi missing CGPA dengan median
median_cgpa <- median(data$CGPA, na.rm = TRUE)
data$CGPA[is.na(data$CGPA)] <- median_cgpa
#Penanganan string kosong untuk subtance_use
# Ubah jadi faktor dulu
data$Substance_Use <- as.factor(data$Substance_Use)
# Hitung modus
modus_substance <- names(sort(table(data$Substance_Use), decreasing = TRUE))[1]
modus_substance
## [1] "Never"
# Ganti string kosong dengan modus
data$Substance_Use[data$Substance_Use == ""] <- modus_substance
#mengecek ulang
sapply(data, function(x) sum(is.na(x) | x == "", na.rm = TRUE))
## Age Course
## 0 0
## Gender CGPA
## 0 0
## Stress_Level Depression_Score
## 0 0
## Anxiety_Score Sleep_Quality
## 0 0
## Physical_Activity Diet_Quality
## 0 0
## Social_Support Relationship_Status
## 0 0
## Substance_Use Counseling_Service_Use
## 0 0
## Family_History Chronic_Illness
## 0 0
## Financial_Stress Extracurricular_Involvement
## 0 0
## Semester_Credit_Load Residence_Type
## 0 0
#Mengubah Tipe Data
# Faktor nominal (tanpa urutan)
data$Course <- as.factor(data$Course)
data$Gender <- as.factor(data$Gender)
data$Relationship_Status <- as.factor(data$Relationship_Status)
data$Family_History <- as.factor(data$Family_History)
data$Chronic_Illness <- as.factor(data$Chronic_Illness)
data$Residence_Type <- as.factor(data$Residence_Type)
# Klasifikasi ulang Stress_Level numerik 0–5 menjadi 3 kelas
data$Stress_Level<- cut(data$Stress_Level,
breaks = c(-1, 1, 3, 5),
labels = c("Stres Ringan", "Stres Sedang", "Stres Berat"),
right = TRUE,
ordered_result = TRUE)
data$Depression_Score <- factor(data$Depression_Score, levels = 0:5, ordered = TRUE)
data$Anxiety_Score <- factor(data$Anxiety_Score, levels = 0:5, ordered = TRUE)
data$Sleep_Quality <- factor(data$Sleep_Quality,
levels = c("Poor", "Average", "Good"), ordered = TRUE)
data$Physical_Activity <- factor(data$Physical_Activity,
levels = c("Low", "Moderate", "High"), ordered = TRUE)
data$Diet_Quality <- factor(data$Diet_Quality,
levels = c("Poor", "Average", "Good"), ordered = TRUE)
data$Social_Support <- factor(data$Social_Support,
levels = c("Low", "Moderate", "High"), ordered = TRUE)
data$Substance_Use <- factor(data$Substance_Use,
levels = c("Never", "Occasionally", "Frequently"), ordered = TRUE)
data$Counseling_Service_Use <- factor(data$Counseling_Service_Use,
levels = c("Never", "Occasionally", "Frequently"), ordered = TRUE)
data$Financial_Stress <- factor(data$Financial_Stress, levels = 0:5, ordered = TRUE)
data$Extracurricular_Involvement <- factor(data$Extracurricular_Involvement,
levels = c("Low", "Moderate", "High"), ordered = TRUE)
Distribusi Target: Pie chart digunakan untuk
menunjukkan proporsi masing-masing kategori tingkat stres pada variabel
Stress_Level.
Visualisasi Variabel Numerik: Histogram dan boxplot digunakan untuk mengeksplorasi distribusi data numerik serta mengidentifikasi keberadaan outlier.
Korelasi Ordinal: Analisis korelasi Spearman diterapkan untuk mengukur hubungan antar variabel ordinal, disertai visualisasi dalam bentuk correlation matrix (peta korelasi).
Visualisasi Relasi: Stacked bar chart dan jitter
plot digunakan untuk menggambarkan hubungan antara variabel prediktor
dengan Stress_Level, sehingga pola dan kecenderungan antar
kelas dapat diamati secara visual.
#EDA
#Distribusi Target (Stress_Level)
library(ggplot2)
library(dplyr)
library(RColorBrewer)
# Hitung frekuensi dan proporsi
df_pie <- data %>%
count(Stress_Level) %>%
mutate(prop = n / sum(n),
label = scales::percent(prop))
# Buat pie chart dengan ggplot2
ggplot(df_pie, aes(x = "", y = prop, fill = Stress_Level)) +
geom_col(color = "white") +
coord_polar(theta = "y") +
geom_text(aes(label = label),
position = position_stack(vjust = 0.5)) +
labs(title = "Distribusi Tingkat Stres Mahasiswa") +
theme_void() +
theme(legend.title = element_blank()) +
scale_fill_brewer(palette = "Set2")
#Distribusi dan Outlier Variabel Numerik
# Konversi variabel ordinal jadi numerik untuk keperluan visualisasi
data_numeric <- data %>%
mutate(across(c(Depression_Score, Anxiety_Score, Financial_Stress), as.numeric))
# Buat list semua variabel numerik
num_vars <- names(select_if(data, is.numeric))
# Histogram dan boxplot untuk tiap variabel
library(dplyr)
for (var in num_vars) {
# Buat bins manual dengan cut
data_binned <- data_numeric %>%
mutate(bin = cut(.data[[var]], breaks = seq(floor(min(data_numeric[[var]], na.rm=TRUE)),
ceiling(max(data_numeric[[var]], na.rm=TRUE)),
by = 0.2), right = FALSE)) %>%
group_by(bin) %>%
summarise(count = n()) %>%
filter(!is.na(bin))
# Buat plot batang yang rapat
print(
ggplot(data_binned, aes(x = bin, y = count)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black", width = 1) +
labs(title = paste("Distribusi", var), x = var, y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
)
# Boxplot tetap sama
print(
ggplot(data_numeric, aes_string(y = var)) +
geom_boxplot(fill = "salmon") +
labs(title = paste("Boxplot", var)) +
theme_minimal()
)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Buat list semua variabel numerik
num_vars <- names(select_if(data, is.numeric))
# Fungsi deteksi outlier metode IQR
detect_outliers_iqr <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
outlier_idx <- which(x < lower_bound | x > upper_bound)
return(list(index = outlier_idx, values = x[outlier_idx]))
}
# Deteksi outlier untuk setiap variabel numerik
for (var in num_vars) {
outlier_result <- detect_outliers_iqr(data_numeric[[var]])
cat("\n==== Variabel:", var, "====\n")
cat("Jumlah outlier:", length(outlier_result$index), "\n")
cat("Nilai-nilai outlier:\n")
print(outlier_result$values)
}
##
## ==== Variabel: Age ====
## Jumlah outlier: 131
## Nilai-nilai outlier:
## [1] 35 35 35 34 33 33 35 35 33 33 33 35 35 33 35 33 34 35 33 33 33 35 35 34 35
## [26] 34 34 34 33 35 33 33 34 34 35 33 33 34 34 34 34 33 34 34 35 34 35 35 33 35
## [51] 34 33 33 34 33 33 34 33 34 34 35 35 35 35 33 33 33 33 34 33 33 35 33 35 33
## [76] 34 34 34 34 34 33 34 34 35 33 33 33 33 35 35 34 34 34 33 34 34 33 34 33 33
## [101] 33 35 33 34 34 35 35 34 34 33 33 34 33 33 33 34 34 35 33 34 33 34 33 34 34
## [126] 33 33 33 33 33 33
##
## ==== Variabel: CGPA ====
## Jumlah outlier: 16
## Nilai-nilai outlier:
## [1] 2.44 2.64 2.65 2.60 2.58 2.52 2.61 2.52 2.66 2.65 2.63 2.65 2.49 2.67 2.64
## [16] 2.60
##
## ==== Variabel: Semester_Credit_Load ====
## Jumlah outlier: 0
## Nilai-nilai outlier:
## integer(0)
#Penanganan outlier
# Hitung quantile batas bawah dan atas
qnt <- quantile(data$CGPA, probs = c(0.05, 0.95), na.rm = TRUE)
# Buat kolom winsorized: batas bawah pakai pmax, batas atas pakai pmin
data$CGPA_winsor <- pmin(pmax(data$CGPA, qnt[1]), qnt[2])
summary(data$CGPA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.440 3.290 3.500 3.491 3.700 4.000
summary(data$CGPA_winsor)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.010 3.290 3.500 3.497 3.700 3.990
boxplot(data$CGPA, main = "Sebelum Winsorize")
boxplot(data$CGPA_winsor, main = "Sesudah Winsorize")
# Hitung quantile batas bawah dan atas
qnt <- quantile(data$Age, probs = c(0.05, 0.95), na.rm = TRUE)
# Buat kolom winsorized: batas bawah pakai pmax, batas atas pakai pmin
data$Age_winsor <- pmin(pmax(data$Age, qnt[1]), qnt[2])
summary(data$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18 20 22 23 25 35
summary(data$Age_winsor)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 20.00 22.00 22.94 25.00 31.00
boxplot(data$Age, main = "Sebelum Winsorize")
boxplot(data$Age_winsor, main = "Sesudah Winsorize")
#menghapus kolom karenaa sudah digantikan
library(dplyr)
data <- dplyr::select(data, -CGPA, -Age)
library(tidyverse)
# Cek Korelasi antar Variabel Ordinal
# Pilih variabel ordinal
ordinal_vars <- dplyr::select(data,
Stress_Level, Depression_Score, Anxiety_Score, Sleep_Quality,
Physical_Activity, Diet_Quality, Social_Support, Substance_Use,
Counseling_Service_Use, Financial_Stress, Extracurricular_Involvement)
# Ubah ke numerik (tanpa mengubah urutan faktor)
ordinal_numeric <- ordinal_vars %>%
mutate(across(everything(), as.numeric))
# Hitung korelasi Spearman
cor_matrix <- cor(ordinal_numeric, method = "spearman")
# Visualisasi korelasi
library(corrplot)
## corrplot 0.95 loaded
corrplot(cor_matrix, method = "color", type = "upper", tl.cex = 0.7)
library(ggcorrplot)
cor_matrix <- cor(ordinal_numeric, use = "complete.obs", method = "spearman")
ggcorrplot(cor_matrix, hc.order = TRUE, type = "lower",
lab = TRUE, lab_size = 3,
colors = c("red", "white", "blue"))
#Visualisasi hubungan target dengan prediktor
#variabel kategorik
kategorik_vars <- c("Gender", "Course", "Relationship_Status", "Family_History",
"Chronic_Illness", "Residence_Type")
# Loop visualisasi stacked bar
for (var in kategorik_vars) {
p <- ggplot(data, aes_string(x = "Stress_Level", fill = var)) +
geom_bar(position = "fill") +
labs(title = paste("Proporsi", var, "pada Tiap Tingkat Stres"),
y = "Proporsi", x = "Tingkat Stres") +
scale_y_continuous(labels = scales::percent_format()) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
}
library(ggplot2)
# Boxplot CGPA_winsor vs Stress_Level
ggplot(data, aes(x = Stress_Level, y = CGPA_winsor)) +
geom_boxplot(fill = "skyblue") +
labs(title = "Distribusi CGPA berdasarkan Tingkat Stres",
x = "Tingkat Stres", y = "CGPA (Winsorized)") +
theme_minimal()
# Boxplot Age_winsor vs Stress_Level
ggplot(data, aes(x = Stress_Level, y = Age_winsor)) +
geom_boxplot(fill = "salmon") +
labs(title = "Distribusi Usia berdasarkan Tingkat Stres",
x = "Tingkat Stres", y = "Usia (Winsorized)") +
theme_minimal()
# Boxplot Age_winsor vs Stress_Level
ggplot(data, aes(x = Stress_Level, y = Semester_Credit_Load)) +
geom_boxplot(fill = "green") +
labs(title = "Distribusi Beban credit semester berdasarkan Tingkat Stres",
x = "Tingkat Stres", y = "Beban credit semester") +
theme_minimal()
# Daftar prediktor ordinal
ordinal_vars <- c("Depression_Score", "Anxiety_Score", "Sleep_Quality",
"Physical_Activity", "Diet_Quality", "Social_Support",
"Substance_Use", "Counseling_Service_Use",
"Financial_Stress", "Extracurricular_Involvement")
# Loop jitter plot
for (var in ordinal_vars) {
p <- ggplot(data, aes_string(x = "Stress_Level", y = var)) +
geom_jitter(width = 0.2, height = 0.2, alpha = 0.6, color = "darkblue") +
labs(title = paste("Hubungan", var, "dengan Tingkat Stres"),
x = "Tingkat Stres", y = var) +
theme_minimal()
print(p)
}
Uji pemilihan variabel dilakukan untuk menentukan prediktor yang
signifikan terhadap variabel target Stress_Level. Adapun
jenis uji yang digunakan disesuaikan dengan tipe data masing-masing
variabel:
Gender,
Course). Variabel dipilih jika memiliki p-value
< 0.05.Depression_Score,
Anxiety_Score). Variabel dipertahankan jika p-value
< 0.05.CGPA_winsor,
Age_winsor) untuk menguji perbedaan antar kelompok
Stress_Level. Variabel dipilih jika p-value <
0.05.Hasil dari ketiga jenis uji ini digunakan sebagai dasar untuk menyaring variabel-variabel yang akan dimasukkan ke dalam pemodelan lebih lanjut. a p-value < 0.05. Hasil dari ketiga uji ini digunakan untuk menyaring variabel yang masuk ke dalam model.
# Load library
library(dplyr)
library(ggpubr)
# 1. UJI CHI-SQUARE untuk variabel kategorik nominal
cat_vars <- c("Gender", "Course", "Relationship_Status", "Family_History",
"Chronic_Illness", "Residence_Type")
for (var in cat_vars) {
tab <- table(data[[var]], data$Stress_Level)
test <- chisq.test(tab)
cat("==== Variabel:", var, "====\n")
print(test)
cat("\n")
}
## ==== Variabel: Gender ====
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 0.56721, df = 2, p-value = 0.7531
##
##
## ==== Variabel: Course ====
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 736.63, df = 10, p-value < 2.2e-16
##
##
## ==== Variabel: Relationship_Status ====
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 0.84678, df = 4, p-value = 0.9321
##
##
## ==== Variabel: Family_History ====
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 3.0644, df = 2, p-value = 0.2161
##
##
## ==== Variabel: Chronic_Illness ====
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 3.0551, df = 2, p-value = 0.2171
##
##
## ==== Variabel: Residence_Type ====
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 3.4645, df = 4, p-value = 0.4833
# 2. KORELASI SPEARMAN untuk variabel ordinal
ordinal_vars <- c("Depression_Score", "Anxiety_Score", "Sleep_Quality",
"Physical_Activity", "Diet_Quality", "Social_Support",
"Substance_Use", "Counseling_Service_Use",
"Financial_Stress", "Extracurricular_Involvement")
# Konversi Stress_Level dan ordinal lain ke numerik
data_numeric <- data %>%
mutate(across(all_of(c("Stress_Level", ordinal_vars)), as.numeric))
# Hitung korelasi Spearman satu per satu
for (var in ordinal_vars) {
cor_test <- cor.test(data_numeric[[var]], data_numeric$Stress_Level, method = "spearman")
cat("==== Korelasi dengan Stress_Level:", var, "====\n")
print(cor_test)
cat("\n")
}
## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties
## ==== Korelasi dengan Stress_Level: Depression_Score ====
##
## Spearman's rank correlation rho
##
## data: data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.9883e+10, p-value = 0.001577
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.03770447
## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties
## ==== Korelasi dengan Stress_Level: Anxiety_Score ====
##
## Spearman's rank correlation rho
##
## data: data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.9514e+10, p-value = 0.008715
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.03129995
## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties
## ==== Korelasi dengan Stress_Level: Sleep_Quality ====
##
## Spearman's rank correlation rho
##
## data: data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7308e+10, p-value = 0.5617
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.006926724
## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties
## ==== Korelasi dengan Stress_Level: Physical_Activity ====
##
## Spearman's rank correlation rho
##
## data: data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7398e+10, p-value = 0.6535
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.005358033
## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties
## ==== Korelasi dengan Stress_Level: Diet_Quality ====
##
## Spearman's rank correlation rho
##
## data: data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.782e+10, p-value = 0.8701
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.001951299
## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties
## ==== Korelasi dengan Stress_Level: Social_Support ====
##
## Spearman's rank correlation rho
##
## data: data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7776e+10, p-value = 0.9202
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.001195238
## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties
## ==== Korelasi dengan Stress_Level: Substance_Use ====
##
## Spearman's rank correlation rho
##
## data: data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7501e+10, p-value = 0.765
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.003567904
## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties
## ==== Korelasi dengan Stress_Level: Counseling_Service_Use ====
##
## Spearman's rank correlation rho
##
## data: data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.6713e+10, p-value = 0.1487
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.01723695
## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties
## ==== Korelasi dengan Stress_Level: Financial_Stress ====
##
## Spearman's rank correlation rho
##
## data: data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7605e+10, p-value = 0.8823
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.001767522
## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties
## ==== Korelasi dengan Stress_Level: Extracurricular_Involvement ====
##
## Spearman's rank correlation rho
##
## data: data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7591e+10, p-value = 0.8659
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.002014835
# 3. UJI KRUSKAL-WALLIS untuk variabel numerik
kruskal_vars <- c("CGPA_winsor", "Age_winsor", "Semester_Credit_Load")
for (var in kruskal_vars) {
test <- kruskal.test(data[[var]] ~ data$Stress_Level)
cat("==== Kruskal-Wallis untuk:", var, "====\n")
print(test)
cat("\n")
}
## ==== Kruskal-Wallis untuk: CGPA_winsor ====
##
## Kruskal-Wallis rank sum test
##
## data: data[[var]] by data$Stress_Level
## Kruskal-Wallis chi-squared = 0.22562, df = 2, p-value = 0.8933
##
##
## ==== Kruskal-Wallis untuk: Age_winsor ====
##
## Kruskal-Wallis rank sum test
##
## data: data[[var]] by data$Stress_Level
## Kruskal-Wallis chi-squared = 5.221, df = 2, p-value = 0.0735
##
##
## ==== Kruskal-Wallis untuk: Semester_Credit_Load ====
##
## Kruskal-Wallis rank sum test
##
## data: data[[var]] by data$Stress_Level
## Kruskal-Wallis chi-squared = 0.56382, df = 2, p-value = 0.7543
selected_vars <- c("Course","Depression_Score", "Anxiety_Score", "Age_winsor",
"Counseling_Service_Use")
Sebelum membangun model Linear Discriminant Analysis (LDA), dilakukan pengujian terhadap beberapa asumsi penting:
Uji Normalitas Multivariat (Mardia Test): Digunakan untuk memastikan bahwa data multivariat mengikuti distribusi normal. Asumsi ini penting agar hasil estimasi LDA menjadi valid dan efisien.
Uji Homogenitas Kovarian (Box’s M Test):
Digunakan untuk menguji apakah matriks kovariansi antar kelompok
Stress_Level bersifat homogen. Homogenitas ini merupakan
syarat utama dalam penerapan LDA.
Kedua uji ini digunakan sebagai dasar untuk menilai kelayakan penerapan model LDA terhadap dataset.
#1. Uji Asumsi Analisis Diskriminan
#a. Uji Normalitas Multivariat (Mardia)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# Pastikan data kamu hanya numerik
lda_data <- data %>%
select(all_of(selected_vars)) %>%
mutate(across(everything(), as.numeric))
mardia_result <- mardia(lda_data)
print(mardia_result)
## Call: mardia(x = lda_data)
##
## Mardia tests of multivariate skew and kurtosis
## Use describe(x) the to get univariate tests
## n.obs = 7022 num.vars = 5
## b1p = 1.6 skew = 1875.39 with probability <= 0
## small sample skew = 1876.46 with probability <= 0
## b2p = 31.04 kurtosis = -19.85 with probability <= 0
#b. Uji Homogenitas Kovarian (Box’s M Test)
library(biotools)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## ---
## biotools version 4.3
# Gabungkan variabel prediktor numerik dan label
lda_data_grouped <- data.frame(Stress_Level = data$Stress_Level, lda_data)
# Jalankan uji Box's M
boxM_result <- boxM(lda_data_grouped[, -1], grouping = lda_data_grouped$Stress_Level)
print(boxM_result)
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: lda_data_grouped[, -1]
## Chi-Sq (approx.) = 74.02, df = 30, p-value = 1.378e-05
Stress_Level.polr() dari paket
MASS, dengan Stress_Level sebagai variabel
target.Stress_Level.#MODEL LDA
#1. persiapan data
library(dplyr)
library(tidyverse)
# Pastikan selected_vars sudah sesuai
selected_vars <- c("Course","Depression_Score", "Anxiety_Score", "Age_winsor",
"Counseling_Service_Use")
lda_data <- dplyr::select(data, Stress_Level, dplyr::all_of(selected_vars)) %>%
dplyr::mutate(dplyr::across(-Stress_Level, as.numeric))
#2. model lda
library(MASS)
lda_model <- lda(Stress_Level ~ ., data = lda_data)
print(lda_model)
## Call:
## lda(Stress_Level ~ ., data = lda_data)
##
## Prior probabilities of groups:
## Stres Ringan Stres Sedang Stres Berat
## 0.3319567 0.3822273 0.2858160
##
## Group means:
## Course Depression_Score Anxiety_Score Age_winsor
## Stres Ringan 3.475761 3.325611 3.364221 23.02960
## Stres Sedang 3.682936 3.263040 3.292474 22.94560
## Stres Berat 4.163926 3.160438 3.237170 22.81764
## Counseling_Service_Use
## Stres Ringan 1.469755
## Stres Sedang 1.497765
## Stres Berat 1.501246
##
## Coefficients of linear discriminants:
## LD1 LD2
## Course 0.64529526 0.14695283
## Depression_Score -0.05388436 0.13899196
## Anxiety_Score -0.11922127 0.33482563
## Age_winsor -0.03631326 0.03142163
## Counseling_Service_Use 0.11926662 -1.16128465
##
## Proportion of trace:
## LD1 LD2
## 0.9924 0.0076
#3. Prediksi dan Evaluasi Akurasi
# Lakukan prediksi
lda_pred <- predict(lda_model)
# Confusion Matrix
table(Predicted = lda_pred$class, Actual = lda_data$Stress_Level)
## Actual
## Predicted Stres Ringan Stres Sedang Stres Berat
## Stres Ringan 529 518 237
## Stres Sedang 1424 1729 1372
## Stres Berat 378 437 398
# Hitung akurasi
# Konversi Stress_Level menjadi faktor biasa
lda_data$Stress_Level <- factor(as.character(lda_data$Stress_Level))
# Hitung akurasi
mean(lda_pred$class == lda_data$Stress_Level)
## [1] 0.3782398
#4. Visualisasi Plot fungsi diskriminan
library(ggplot2)
# Ekstrak skor LD dari prediksi
lda_values <- as.data.frame(lda_pred$x)
lda_values$Class <- lda_data$Stress_Level
# Plot LDA (LD1 vs LD2)
ggplot(lda_values, aes(x = LD1, y = LD2, color = Class)) +
geom_point(alpha = 0.5) +
stat_ellipse(level = 0.95, linetype = 2) + # menambahkan ellipse
labs(title = "Visualisasi Fungsi Diskriminan (LD1 vs LD2)",
x = "LD1 (Fungsi Diskriminan 1)",
y = "LD2 (Fungsi Diskriminan 2)") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
#MODEL Regresi Logistik Ordinal
#1. Pembentukan Model (Estimasi Parameter)
# Pastikan target adalah ordered factor
data$Stress_Level <- factor(data$Stress_Level,
levels = c("Stres Ringan", "Stres Sedang", "Stres Berat"),
ordered = TRUE)
# Konversi prediktor ke numerik jika belum
for (var in selected_vars) {
data[[var]] <- as.numeric(data[[var]])
}
# Load library
library(MASS)
# Bangun model regresi logistik ordinal
model_polr <- polr(Stress_Level ~ Course + Depression_Score + Anxiety_Score + Age_winsor +
Counseling_Service_Use,
data = data, Hess = TRUE)
# Tampilkan ringkasan model
summary(model_polr)
## Call:
## polr(formula = Stress_Level ~ Course + Depression_Score + Anxiety_Score +
## Age_winsor + Counseling_Service_Use, data = data, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## Course 0.21412 0.014937 14.334
## Depression_Score -0.01823 0.013759 -1.325
## Anxiety_Score -0.04127 0.013611 -3.032
## Age_winsor -0.01264 0.006023 -2.099
## Counseling_Service_Use 0.04839 0.033189 1.458
##
## Intercepts:
## Value Std. Error t value
## Stres Ringan|Stres Sedang -0.3344 0.1713 -1.9522
## Stres Sedang|Stres Berat 1.3247 0.1721 7.6989
##
## Residual Deviance: 15097.17
## AIC: 15111.17
# 2. Uji Signifikansi Variabel
#a. Uji Parsial (Z-value & P-value)
(ctable <- coef(summary(model_polr)))
## Value Std. Error t value
## Course 0.21411583 0.01493746 14.334157
## Depression_Score -0.01823004 0.01375867 -1.324985
## Anxiety_Score -0.04126707 0.01361146 -3.031789
## Age_winsor -0.01264262 0.00602262 -2.099190
## Counseling_Service_Use 0.04839441 0.03318941 1.458128
## Stres Ringan|Stres Sedang -0.33440519 0.17130022 -1.952159
## Stres Sedang|Stres Berat 1.32472610 0.17206770 7.698866
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))
## Value Std. Error t value p value
## Course 0.21411583 0.01493746 14.334157 1.338620e-46
## Depression_Score -0.01823004 0.01375867 -1.324985 1.851760e-01
## Anxiety_Score -0.04126707 0.01361146 -3.031789 2.431094e-03
## Age_winsor -0.01264262 0.00602262 -2.099190 3.580019e-02
## Counseling_Service_Use 0.04839441 0.03318941 1.458128 1.448053e-01
## Stres Ringan|Stres Sedang -0.33440519 0.17130022 -1.952159 5.091938e-02
## Stres Sedang|Stres Berat 1.32472610 0.17206770 7.698866 1.372794e-14
# b. Uji Serentak (Likelihood Ratio Test)
null_model <- polr(Stress_Level ~ 1, data = data, Hess = TRUE)
anova(null_model, model_polr)
## Likelihood ratio tests of ordinal regression models
##
## Response: Stress_Level
## Model
## 1 1
## 2 Course + Depression_Score + Anxiety_Score + Age_winsor + Counseling_Service_Use
## Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
## 1 7020 15330.81
## 2 7015 15097.17 1 vs 2 5 233.6315 0
#3. Evaluasi Akurasi Model
# Prediksi kelas
pred <- predict(model_polr, data)
# Confusion matrix
table(Predicted = pred, Actual = data$Stress_Level)
## Actual
## Predicted Stres Ringan Stres Sedang Stres Berat
## Stres Ringan 757 751 341
## Stres Sedang 1383 1715 1550
## Stres Berat 191 218 116
# Akurasi
# Konversi target dari ordered factor → factor biasa
data$Stress_Level <- factor(as.character(data$Stress_Level))
# Lakukan evaluasi akurasi
mean(pred == data$Stress_Level)
## [1] 0.368556
#4. Interpretasi (Odd Ratio)
exp(coef(model_polr))
## Course Depression_Score Anxiety_Score
## 1.2387661 0.9819351 0.9595728
## Age_winsor Counseling_Service_Use
## 0.9874370 1.0495845
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.