uas anmul

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Load library yang dibutuhkan
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Baca dataset
data <- read.csv("C:\\Users\\DELL VOSTRO\\OneDrive\\Documents\\anmul\\students_mental_health_survey.csv")

# Lihat 5 baris pertama
head(data)

##   Age           Course Gender CGPA Stress_Level Depression_Score Anxiety_Score
## 1  25           Others   Male 3.56            3                3             2
## 2  24      Engineering Female 2.44            0                3             0
## 3  19         Business Female 3.74            4                0             3
## 4  19 Computer Science   Male 3.65            2                1             0
## 5  18         Business   Male 3.40            3                3             4
## 6  21          Medical Female 3.35            2                4             3
##   Sleep_Quality Physical_Activity Diet_Quality Social_Support
## 1          Good          Moderate         Good       Moderate
## 2       Average               Low      Average            Low
## 3          Good               Low      Average       Moderate
## 4       Average               Low      Average       Moderate
## 5          Good               Low      Average           High
## 6          Good          Moderate         Good           High
##   Relationship_Status Substance_Use Counseling_Service_Use Family_History
## 1             Married         Never                  Never             No
## 2              Single  Occasionally           Occasionally             No
## 3   In a Relationship         Never           Occasionally             No
## 4              Single                                Never             No
## 5             Married         Never                  Never             No
## 6              Single         Never                  Never             No
##   Chronic_Illness Financial_Stress Extracurricular_Involvement
## 1              No                2                    Moderate
## 2              No                3                         Low
## 3              No                4                        High
## 4              No                4                    Moderate
## 5             Yes                0                        High
## 6              No                5                    Moderate
##   Semester_Credit_Load Residence_Type
## 1                   17      On-Campus
## 2                   27      On-Campus
## 3                   15      On-Campus
## 4                   20     Off-Campus
## 5                   23      On-Campus
## 6                   19     Off-Campus

Preprocessing Data

Import dataset dan cek struktur, ringkasan statistik, duplikat, dan missing value.
Imputasi data:
- Nilai numerik (seperti CGPA) diimputasi dengan median.
- Nilai kategorik kosong (seperti Substance_Use) diisi dengan modus.
Transformasi tipe data ke bentuk yang sesuai:
- Stress_Level dikategorikan ulang jadi tiga kelas ordinal.
- Variabel ordinal seperti Depression_Score, Diet_Quality, dll., diubah ke ordered factor.
Penanganan outlier pada variabel numerik (CGPA, Age) dengan metode winsorizing.

# Lihat struktur data
str(data)

## 'data.frame':    7022 obs. of  20 variables:
##  $ Age                        : int  25 24 19 19 18 21 18 21 24 19 ...
##  $ Course                     : chr  "Others" "Engineering" "Business" "Computer Science" ...
##  $ Gender                     : chr  "Male" "Female" "Female" "Male" ...
##  $ CGPA                       : num  3.56 2.44 3.74 3.65 3.4 3.35 3.65 3.4 3.8 3.05 ...
##  $ Stress_Level               : int  3 0 4 2 3 2 2 0 3 2 ...
##  $ Depression_Score           : int  3 3 0 1 3 4 2 3 2 5 ...
##  $ Anxiety_Score              : int  2 0 3 0 4 3 5 3 1 0 ...
##  $ Sleep_Quality              : chr  "Good" "Average" "Good" "Average" ...
##  $ Physical_Activity          : chr  "Moderate" "Low" "Low" "Low" ...
##  $ Diet_Quality               : chr  "Good" "Average" "Average" "Average" ...
##  $ Social_Support             : chr  "Moderate" "Low" "Moderate" "Moderate" ...
##  $ Relationship_Status        : chr  "Married" "Single" "In a Relationship" "Single" ...
##  $ Substance_Use              : chr  "Never" "Occasionally" "Never" "" ...
##  $ Counseling_Service_Use     : chr  "Never" "Occasionally" "Occasionally" "Never" ...
##  $ Family_History             : chr  "No" "No" "No" "No" ...
##  $ Chronic_Illness            : chr  "No" "No" "No" "No" ...
##  $ Financial_Stress           : int  2 3 4 4 0 5 4 3 2 1 ...
##  $ Extracurricular_Involvement: chr  "Moderate" "Low" "High" "Moderate" ...
##  $ Semester_Credit_Load       : int  17 27 15 20 23 19 20 23 28 27 ...
##  $ Residence_Type             : chr  "On-Campus" "On-Campus" "On-Campus" "Off-Campus" ...

# Lihat ringkasan statistik
summary(data)

##       Age        Course             Gender               CGPA      
##  Min.   :18   Length:7022        Length:7022        Min.   :2.440  
##  1st Qu.:20   Class :character   Class :character   1st Qu.:3.290  
##  Median :22   Mode  :character   Mode  :character   Median :3.500  
##  Mean   :23                                         Mean   :3.491  
##  3rd Qu.:25                                         3rd Qu.:3.700  
##  Max.   :35                                         Max.   :4.000  
##                                                     NA's   :12     
##   Stress_Level   Depression_Score Anxiety_Score Sleep_Quality     
##  Min.   :0.000   Min.   :0.000    Min.   :0.0   Length:7022       
##  1st Qu.:1.000   1st Qu.:1.000    1st Qu.:1.0   Class :character  
##  Median :2.000   Median :2.000    Median :2.0   Mode  :character  
##  Mean   :2.428   Mean   :2.254    Mean   :2.3                     
##  3rd Qu.:4.000   3rd Qu.:3.000    3rd Qu.:4.0                     
##  Max.   :5.000   Max.   :5.000    Max.   :5.0                     
##                                                                   
##  Physical_Activity  Diet_Quality       Social_Support     Relationship_Status
##  Length:7022        Length:7022        Length:7022        Length:7022        
##  Class :character   Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  Substance_Use      Counseling_Service_Use Family_History    
##  Length:7022        Length:7022            Length:7022       
##  Class :character   Class :character       Class :character  
##  Mode  :character   Mode  :character       Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##  Chronic_Illness    Financial_Stress Extracurricular_Involvement
##  Length:7022        Min.   :0.000    Length:7022                
##  Class :character   1st Qu.:1.000    Class :character           
##  Mode  :character   Median :2.000    Mode  :character           
##                     Mean   :2.453                               
##                     3rd Qu.:4.000                               
##                     Max.   :5.000                               
##                                                                 
##  Semester_Credit_Load Residence_Type    
##  Min.   :15.00        Length:7022       
##  1st Qu.:18.00        Class :character  
##  Median :22.00        Mode  :character  
##  Mean   :22.01                          
##  3rd Qu.:26.00                          
##  Max.   :29.00                          
##

# Cek apakah ada baris yang duplikat (seluruh kolom sama persis)
duplikat <- duplicated(data)

# Tampilkan jumlah baris duplikat
sum(duplikat)

## [1] 0

#mengecek missing value
colSums(is.na(data))

##                         Age                      Course 
##                           0                           0 
##                      Gender                        CGPA 
##                           0                          12 
##                Stress_Level            Depression_Score 
##                           0                           0 
##               Anxiety_Score               Sleep_Quality 
##                           0                           0 
##           Physical_Activity                Diet_Quality 
##                           0                           0 
##              Social_Support         Relationship_Status 
##                           0                           0 
##               Substance_Use      Counseling_Service_Use 
##                           0                           0 
##              Family_History             Chronic_Illness 
##                           0                           0 
##            Financial_Stress Extracurricular_Involvement 
##                           0                           0 
##        Semester_Credit_Load              Residence_Type 
##                           0                           0

# Cek jumlah string kosong ("") per kolom
sapply(data, function(x) sum(x == "", na.rm = TRUE))

##                         Age                      Course 
##                           0                           0 
##                      Gender                        CGPA 
##                           0                           0 
##                Stress_Level            Depression_Score 
##                           0                           0 
##               Anxiety_Score               Sleep_Quality 
##                           0                           0 
##           Physical_Activity                Diet_Quality 
##                           0                           0 
##              Social_Support         Relationship_Status 
##                           0                           0 
##               Substance_Use      Counseling_Service_Use 
##                          15                           0 
##              Family_History             Chronic_Illness 
##                           0                           0 
##            Financial_Stress Extracurricular_Involvement 
##                           0                           0 
##        Semester_Credit_Load              Residence_Type 
##                           0                           0

# Gabungkan keduanya
sapply(data, function(x) sum(is.na(x) | x == "", na.rm = TRUE))

##                         Age                      Course 
##                           0                           0 
##                      Gender                        CGPA 
##                           0                          12 
##                Stress_Level            Depression_Score 
##                           0                           0 
##               Anxiety_Score               Sleep_Quality 
##                           0                           0 
##           Physical_Activity                Diet_Quality 
##                           0                           0 
##              Social_Support         Relationship_Status 
##                           0                           0 
##               Substance_Use      Counseling_Service_Use 
##                          15                           0 
##              Family_History             Chronic_Illness 
##                           0                           0 
##            Financial_Stress Extracurricular_Involvement 
##                           0                           0 
##        Semester_Credit_Load              Residence_Type 
##                           0                           0

# Imputasi missing CGPA dengan median
median_cgpa <- median(data$CGPA, na.rm = TRUE)
data$CGPA[is.na(data$CGPA)] <- median_cgpa

#Penanganan string kosong untuk subtance_use
# Ubah jadi faktor dulu 
data$Substance_Use <- as.factor(data$Substance_Use)

# Hitung modus
modus_substance <- names(sort(table(data$Substance_Use), decreasing = TRUE))[1]
modus_substance

## [1] "Never"

# Ganti string kosong dengan modus
data$Substance_Use[data$Substance_Use == ""] <- modus_substance

#mengecek ulang 
sapply(data, function(x) sum(is.na(x) | x == "", na.rm = TRUE))

##                         Age                      Course 
##                           0                           0 
##                      Gender                        CGPA 
##                           0                           0 
##                Stress_Level            Depression_Score 
##                           0                           0 
##               Anxiety_Score               Sleep_Quality 
##                           0                           0 
##           Physical_Activity                Diet_Quality 
##                           0                           0 
##              Social_Support         Relationship_Status 
##                           0                           0 
##               Substance_Use      Counseling_Service_Use 
##                           0                           0 
##              Family_History             Chronic_Illness 
##                           0                           0 
##            Financial_Stress Extracurricular_Involvement 
##                           0                           0 
##        Semester_Credit_Load              Residence_Type 
##                           0                           0

#Mengubah Tipe Data
# Faktor nominal (tanpa urutan)
data$Course <- as.factor(data$Course)
data$Gender <- as.factor(data$Gender)
data$Relationship_Status <- as.factor(data$Relationship_Status)
data$Family_History <- as.factor(data$Family_History)
data$Chronic_Illness <- as.factor(data$Chronic_Illness)
data$Residence_Type <- as.factor(data$Residence_Type)


# Klasifikasi ulang Stress_Level numerik 0–5 menjadi 3 kelas
data$Stress_Level<- cut(data$Stress_Level,
                        breaks = c(-1, 1, 3, 5),
                        labels = c("Stres Ringan", "Stres Sedang", "Stres Berat"),
                        right = TRUE,
                        ordered_result = TRUE)



data$Depression_Score <- factor(data$Depression_Score, levels = 0:5, ordered = TRUE)
data$Anxiety_Score <- factor(data$Anxiety_Score, levels = 0:5, ordered = TRUE)

data$Sleep_Quality <- factor(data$Sleep_Quality,
                             levels = c("Poor", "Average", "Good"), ordered = TRUE)

data$Physical_Activity <- factor(data$Physical_Activity,
                                 levels = c("Low", "Moderate", "High"), ordered = TRUE)

data$Diet_Quality <- factor(data$Diet_Quality,
                            levels = c("Poor", "Average", "Good"), ordered = TRUE)

data$Social_Support <- factor(data$Social_Support,
                              levels = c("Low", "Moderate", "High"), ordered = TRUE)

data$Substance_Use <- factor(data$Substance_Use,
                             levels = c("Never", "Occasionally", "Frequently"), ordered = TRUE)

data$Counseling_Service_Use <- factor(data$Counseling_Service_Use,
                                      levels = c("Never", "Occasionally", "Frequently"), ordered = TRUE)

data$Financial_Stress <- factor(data$Financial_Stress, levels = 0:5, ordered = TRUE)

data$Extracurricular_Involvement <- factor(data$Extracurricular_Involvement,
                                           levels = c("Low", "Moderate", "High"), ordered = TRUE)

Eksplorasi Data (EDA)

Distribusi Target: Pie chart digunakan untuk menunjukkan proporsi masing-masing kategori tingkat stres pada variabel Stress_Level.
Visualisasi Variabel Numerik: Histogram dan boxplot digunakan untuk mengeksplorasi distribusi data numerik serta mengidentifikasi keberadaan outlier.
Korelasi Ordinal: Analisis korelasi Spearman diterapkan untuk mengukur hubungan antar variabel ordinal, disertai visualisasi dalam bentuk correlation matrix (peta korelasi).
Visualisasi Relasi: Stacked bar chart dan jitter plot digunakan untuk menggambarkan hubungan antara variabel prediktor dengan Stress_Level, sehingga pola dan kecenderungan antar kelas dapat diamati secara visual.

#EDA

#Distribusi Target (Stress_Level)
library(ggplot2)
library(dplyr)
library(RColorBrewer)

# Hitung frekuensi dan proporsi
df_pie <- data %>%
  count(Stress_Level) %>%
  mutate(prop = n / sum(n),
         label = scales::percent(prop))

# Buat pie chart dengan ggplot2
ggplot(df_pie, aes(x = "", y = prop, fill = Stress_Level)) +
  geom_col(color = "white") +
  coord_polar(theta = "y") +
  geom_text(aes(label = label),
            position = position_stack(vjust = 0.5)) +
  labs(title = "Distribusi Tingkat Stres Mahasiswa") +
  theme_void() +
  theme(legend.title = element_blank()) +
  scale_fill_brewer(palette = "Set2")

#Distribusi dan Outlier Variabel Numerik
# Konversi variabel ordinal jadi numerik untuk keperluan visualisasi
data_numeric <- data %>%
  mutate(across(c(Depression_Score, Anxiety_Score, Financial_Stress), as.numeric))

# Buat list semua variabel numerik
num_vars <- names(select_if(data, is.numeric))

# Histogram dan boxplot untuk tiap variabel
library(dplyr)

for (var in num_vars) {
  # Buat bins manual dengan cut
  data_binned <- data_numeric %>%
    mutate(bin = cut(.data[[var]], breaks = seq(floor(min(data_numeric[[var]], na.rm=TRUE)),
                                                ceiling(max(data_numeric[[var]], na.rm=TRUE)),
                                                by = 0.2), right = FALSE)) %>%
    group_by(bin) %>%
    summarise(count = n()) %>%
    filter(!is.na(bin))
  
  # Buat plot batang yang rapat
  print(
    ggplot(data_binned, aes(x = bin, y = count)) +
      geom_bar(stat = "identity", fill = "skyblue", color = "black", width = 1) +
      labs(title = paste("Distribusi", var), x = var, y = "Count") +
      theme_minimal() +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
  )
  
  # Boxplot tetap sama
  print(
    ggplot(data_numeric, aes_string(y = var)) +
      geom_boxplot(fill = "salmon") +
      labs(title = paste("Boxplot", var)) +
      theme_minimal()
  )
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Buat list semua variabel numerik
num_vars <- names(select_if(data, is.numeric))

# Fungsi deteksi outlier metode IQR
detect_outliers_iqr <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR_val <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR_val
  upper_bound <- Q3 + 1.5 * IQR_val
  outlier_idx <- which(x < lower_bound | x > upper_bound)
  return(list(index = outlier_idx, values = x[outlier_idx]))
}

# Deteksi outlier untuk setiap variabel numerik
for (var in num_vars) {
  outlier_result <- detect_outliers_iqr(data_numeric[[var]])
  cat("\n==== Variabel:", var, "====\n")
  cat("Jumlah outlier:", length(outlier_result$index), "\n")
  cat("Nilai-nilai outlier:\n")
  print(outlier_result$values)
}

## 
## ==== Variabel: Age ====
## Jumlah outlier: 131 
## Nilai-nilai outlier:
##   [1] 35 35 35 34 33 33 35 35 33 33 33 35 35 33 35 33 34 35 33 33 33 35 35 34 35
##  [26] 34 34 34 33 35 33 33 34 34 35 33 33 34 34 34 34 33 34 34 35 34 35 35 33 35
##  [51] 34 33 33 34 33 33 34 33 34 34 35 35 35 35 33 33 33 33 34 33 33 35 33 35 33
##  [76] 34 34 34 34 34 33 34 34 35 33 33 33 33 35 35 34 34 34 33 34 34 33 34 33 33
## [101] 33 35 33 34 34 35 35 34 34 33 33 34 33 33 33 34 34 35 33 34 33 34 33 34 34
## [126] 33 33 33 33 33 33
## 
## ==== Variabel: CGPA ====
## Jumlah outlier: 16 
## Nilai-nilai outlier:
##  [1] 2.44 2.64 2.65 2.60 2.58 2.52 2.61 2.52 2.66 2.65 2.63 2.65 2.49 2.67 2.64
## [16] 2.60
## 
## ==== Variabel: Semester_Credit_Load ====
## Jumlah outlier: 0 
## Nilai-nilai outlier:
## integer(0)

#Penanganan outlier
# Hitung quantile batas bawah dan atas
qnt <- quantile(data$CGPA, probs = c(0.05, 0.95), na.rm = TRUE)

# Buat kolom winsorized: batas bawah pakai pmax, batas atas pakai pmin
data$CGPA_winsor <- pmin(pmax(data$CGPA, qnt[1]), qnt[2])

summary(data$CGPA)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.440   3.290   3.500   3.491   3.700   4.000

summary(data$CGPA_winsor)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.010   3.290   3.500   3.497   3.700   3.990

boxplot(data$CGPA, main = "Sebelum Winsorize")

boxplot(data$CGPA_winsor, main = "Sesudah Winsorize")

# Hitung quantile batas bawah dan atas
qnt <- quantile(data$Age, probs = c(0.05, 0.95), na.rm = TRUE)

# Buat kolom winsorized: batas bawah pakai pmax, batas atas pakai pmin
data$Age_winsor <- pmin(pmax(data$Age, qnt[1]), qnt[2])

summary(data$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      18      20      22      23      25      35

summary(data$Age_winsor)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   20.00   22.00   22.94   25.00   31.00

boxplot(data$Age, main = "Sebelum Winsorize")

boxplot(data$Age_winsor, main = "Sesudah Winsorize")

#menghapus kolom karenaa sudah digantikan
library(dplyr)
data <- dplyr::select(data, -CGPA, -Age)





library(tidyverse)

# Cek Korelasi antar Variabel Ordinal
# Pilih variabel ordinal
ordinal_vars <- dplyr::select(data,
                              Stress_Level, Depression_Score, Anxiety_Score, Sleep_Quality,
                              Physical_Activity, Diet_Quality, Social_Support, Substance_Use,
                              Counseling_Service_Use, Financial_Stress, Extracurricular_Involvement)


# Ubah ke numerik (tanpa mengubah urutan faktor)
ordinal_numeric <- ordinal_vars %>%
  mutate(across(everything(), as.numeric))

# Hitung korelasi Spearman
cor_matrix <- cor(ordinal_numeric, method = "spearman")

# Visualisasi korelasi
library(corrplot)

## corrplot 0.95 loaded

corrplot(cor_matrix, method = "color", type = "upper", tl.cex = 0.7)

library(ggcorrplot)
cor_matrix <- cor(ordinal_numeric, use = "complete.obs", method = "spearman")
ggcorrplot(cor_matrix, hc.order = TRUE, type = "lower",
           lab = TRUE, lab_size = 3, 
           colors = c("red", "white", "blue"))

#Visualisasi hubungan target dengan prediktor 
#variabel kategorik
kategorik_vars <- c("Gender", "Course", "Relationship_Status", "Family_History",
                    "Chronic_Illness", "Residence_Type")

# Loop visualisasi stacked bar
for (var in kategorik_vars) {
  p <- ggplot(data, aes_string(x = "Stress_Level", fill = var)) +
    geom_bar(position = "fill") +
    labs(title = paste("Proporsi", var, "pada Tiap Tingkat Stres"),
         y = "Proporsi", x = "Tingkat Stres") +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  print(p)
}

library(ggplot2)

# Boxplot CGPA_winsor vs Stress_Level
ggplot(data, aes(x = Stress_Level, y = CGPA_winsor)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Distribusi CGPA berdasarkan Tingkat Stres",
       x = "Tingkat Stres", y = "CGPA (Winsorized)") +
  theme_minimal()

# Boxplot Age_winsor vs Stress_Level
ggplot(data, aes(x = Stress_Level, y = Age_winsor)) +
  geom_boxplot(fill = "salmon") +
  labs(title = "Distribusi Usia berdasarkan Tingkat Stres",
       x = "Tingkat Stres", y = "Usia (Winsorized)") +
  theme_minimal()

# Boxplot Age_winsor vs Stress_Level
ggplot(data, aes(x = Stress_Level, y = Semester_Credit_Load)) +
  geom_boxplot(fill = "green") +
  labs(title = "Distribusi Beban credit semester berdasarkan Tingkat Stres",
       x = "Tingkat Stres", y = "Beban credit semester") +
  theme_minimal()

# Daftar prediktor ordinal
ordinal_vars <- c("Depression_Score", "Anxiety_Score", "Sleep_Quality",
                  "Physical_Activity", "Diet_Quality", "Social_Support",
                  "Substance_Use", "Counseling_Service_Use",
                  "Financial_Stress", "Extracurricular_Involvement")

# Loop jitter plot
for (var in ordinal_vars) {
  p <- ggplot(data, aes_string(x = "Stress_Level", y = var)) +
    geom_jitter(width = 0.2, height = 0.2, alpha = 0.6, color = "darkblue") +
    labs(title = paste("Hubungan", var, "dengan Tingkat Stres"),
         x = "Tingkat Stres", y = var) +
    theme_minimal()
  
  print(p)
}

Uji Pemilihan Variabel

Uji pemilihan variabel dilakukan untuk menentukan prediktor yang signifikan terhadap variabel target Stress_Level. Adapun jenis uji yang digunakan disesuaikan dengan tipe data masing-masing variabel:

Uji Chi-Square: Digunakan untuk variabel nominal (misalnya: Gender, Course). Variabel dipilih jika memiliki p-value < 0.05.
Uji Korelasi Spearman: Digunakan untuk variabel ordinal (misalnya: Depression_Score, Anxiety_Score). Variabel dipertahankan jika p-value < 0.05.
Uji Kruskal-Wallis: Digunakan untuk variabel numerik (CGPA_winsor, Age_winsor) untuk menguji perbedaan antar kelompok Stress_Level. Variabel dipilih jika p-value < 0.05.

Hasil dari ketiga jenis uji ini digunakan sebagai dasar untuk menyaring variabel-variabel yang akan dimasukkan ke dalam pemodelan lebih lanjut. a p-value < 0.05. Hasil dari ketiga uji ini digunakan untuk menyaring variabel yang masuk ke dalam model.

# Load library
library(dplyr)
library(ggpubr)

# 1. UJI CHI-SQUARE untuk variabel kategorik nominal
cat_vars <- c("Gender", "Course", "Relationship_Status", "Family_History",
              "Chronic_Illness", "Residence_Type")

for (var in cat_vars) {
  tab <- table(data[[var]], data$Stress_Level)
  test <- chisq.test(tab)
  cat("==== Variabel:", var, "====\n")
  print(test)
  cat("\n")
}

## ==== Variabel: Gender ====
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 0.56721, df = 2, p-value = 0.7531
## 
## 
## ==== Variabel: Course ====
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 736.63, df = 10, p-value < 2.2e-16
## 
## 
## ==== Variabel: Relationship_Status ====
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 0.84678, df = 4, p-value = 0.9321
## 
## 
## ==== Variabel: Family_History ====
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 3.0644, df = 2, p-value = 0.2161
## 
## 
## ==== Variabel: Chronic_Illness ====
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 3.0551, df = 2, p-value = 0.2171
## 
## 
## ==== Variabel: Residence_Type ====
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 3.4645, df = 4, p-value = 0.4833

# 2. KORELASI SPEARMAN untuk variabel ordinal
ordinal_vars <- c("Depression_Score", "Anxiety_Score", "Sleep_Quality",
                  "Physical_Activity", "Diet_Quality", "Social_Support",
                  "Substance_Use", "Counseling_Service_Use",
                  "Financial_Stress", "Extracurricular_Involvement")

# Konversi Stress_Level dan ordinal lain ke numerik
data_numeric <- data %>%
  mutate(across(all_of(c("Stress_Level", ordinal_vars)), as.numeric))

# Hitung korelasi Spearman satu per satu
for (var in ordinal_vars) {
  cor_test <- cor.test(data_numeric[[var]], data_numeric$Stress_Level, method = "spearman")
  cat("==== Korelasi dengan Stress_Level:", var, "====\n")
  print(cor_test)
  cat("\n")
}

## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties

## ==== Korelasi dengan Stress_Level: Depression_Score ====
## 
##  Spearman's rank correlation rho
## 
## data:  data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.9883e+10, p-value = 0.001577
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.03770447

## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties

## ==== Korelasi dengan Stress_Level: Anxiety_Score ====
## 
##  Spearman's rank correlation rho
## 
## data:  data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.9514e+10, p-value = 0.008715
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.03129995

## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties

## ==== Korelasi dengan Stress_Level: Sleep_Quality ====
## 
##  Spearman's rank correlation rho
## 
## data:  data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7308e+10, p-value = 0.5617
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## 0.006926724

## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties

## ==== Korelasi dengan Stress_Level: Physical_Activity ====
## 
##  Spearman's rank correlation rho
## 
## data:  data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7398e+10, p-value = 0.6535
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## 0.005358033

## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties

## ==== Korelasi dengan Stress_Level: Diet_Quality ====
## 
##  Spearman's rank correlation rho
## 
## data:  data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.782e+10, p-value = 0.8701
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##          rho 
## -0.001951299

## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties

## ==== Korelasi dengan Stress_Level: Social_Support ====
## 
##  Spearman's rank correlation rho
## 
## data:  data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7776e+10, p-value = 0.9202
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##          rho 
## -0.001195238

## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties

## ==== Korelasi dengan Stress_Level: Substance_Use ====
## 
##  Spearman's rank correlation rho
## 
## data:  data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7501e+10, p-value = 0.765
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## 0.003567904

## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties

## ==== Korelasi dengan Stress_Level: Counseling_Service_Use ====
## 
##  Spearman's rank correlation rho
## 
## data:  data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.6713e+10, p-value = 0.1487
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.01723695

## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties

## ==== Korelasi dengan Stress_Level: Financial_Stress ====
## 
##  Spearman's rank correlation rho
## 
## data:  data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7605e+10, p-value = 0.8823
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## 0.001767522

## Warning in cor.test.default(data_numeric[[var]], data_numeric$Stress_Level, :
## Cannot compute exact p-value with ties

## ==== Korelasi dengan Stress_Level: Extracurricular_Involvement ====
## 
##  Spearman's rank correlation rho
## 
## data:  data_numeric[[var]] and data_numeric$Stress_Level
## S = 5.7591e+10, p-value = 0.8659
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## 0.002014835

# 3. UJI KRUSKAL-WALLIS untuk variabel numerik
kruskal_vars <- c("CGPA_winsor", "Age_winsor", "Semester_Credit_Load")

for (var in kruskal_vars) {
  test <- kruskal.test(data[[var]] ~ data$Stress_Level)
  cat("==== Kruskal-Wallis untuk:", var, "====\n")
  print(test)
  cat("\n")
}

## ==== Kruskal-Wallis untuk: CGPA_winsor ====
## 
##  Kruskal-Wallis rank sum test
## 
## data:  data[[var]] by data$Stress_Level
## Kruskal-Wallis chi-squared = 0.22562, df = 2, p-value = 0.8933
## 
## 
## ==== Kruskal-Wallis untuk: Age_winsor ====
## 
##  Kruskal-Wallis rank sum test
## 
## data:  data[[var]] by data$Stress_Level
## Kruskal-Wallis chi-squared = 5.221, df = 2, p-value = 0.0735
## 
## 
## ==== Kruskal-Wallis untuk: Semester_Credit_Load ====
## 
##  Kruskal-Wallis rank sum test
## 
## data:  data[[var]] by data$Stress_Level
## Kruskal-Wallis chi-squared = 0.56382, df = 2, p-value = 0.7543

selected_vars <- c("Course","Depression_Score", "Anxiety_Score", "Age_winsor", 
                   "Counseling_Service_Use")

Uji Asumsi Analisis Diskriminan

Sebelum membangun model Linear Discriminant Analysis (LDA), dilakukan pengujian terhadap beberapa asumsi penting:

Uji Normalitas Multivariat (Mardia Test): Digunakan untuk memastikan bahwa data multivariat mengikuti distribusi normal. Asumsi ini penting agar hasil estimasi LDA menjadi valid dan efisien.
Uji Homogenitas Kovarian (Box’s M Test): Digunakan untuk menguji apakah matriks kovariansi antar kelompok Stress_Level bersifat homogen. Homogenitas ini merupakan syarat utama dalam penerapan LDA.

Kedua uji ini digunakan sebagai dasar untuk menilai kelayakan penerapan model LDA terhadap dataset.

#1. Uji Asumsi Analisis Diskriminan
#a. Uji Normalitas Multivariat (Mardia)
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

# Pastikan data kamu hanya numerik
lda_data <- data %>%
  select(all_of(selected_vars)) %>%
  mutate(across(everything(), as.numeric))
mardia_result <- mardia(lda_data)

print(mardia_result)

## Call: mardia(x = lda_data)
## 
## Mardia tests of multivariate skew and kurtosis
## Use describe(x) the to get univariate tests
## n.obs = 7022   num.vars =  5 
## b1p =  1.6   skew =  1875.39  with probability  <=  0
##  small sample skew =  1876.46  with probability <=  0
## b2p =  31.04   kurtosis =  -19.85  with probability <=  0

#b. Uji Homogenitas Kovarian (Box’s M Test)

library(biotools)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## ---
## biotools version 4.3

# Gabungkan variabel prediktor numerik dan label
lda_data_grouped <- data.frame(Stress_Level = data$Stress_Level, lda_data)

# Jalankan uji Box's M
boxM_result <- boxM(lda_data_grouped[, -1], grouping = lda_data_grouped$Stress_Level)
print(boxM_result)

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  lda_data_grouped[, -1]
## Chi-Sq (approx.) = 74.02, df = 30, p-value = 1.378e-05

MODEL

a. Model LDA (Linear Discriminant Analysis)

Model dibangun menggunakan variabel-variabel yang telah diseleksi melalui uji signifikansi.
Hasil model divisualisasikan menggunakan plot antara komponen Linear Discriminant pertama (LD1) dan kedua (LD2), untuk melihat pemisahan antar kelas Stress_Level.

b. Model Regresi Logistik Ordinal (polr)

Model dibentuk menggunakan fungsi polr() dari paket MASS, dengan Stress_Level sebagai variabel target.
Evaluasi dilakukan melalui:
- Uji parsial: Menggunakan nilai z-value dan p-value untuk masing-masing variabel prediktor.
- Uji serentak: Menggunakan Likelihood Ratio Test untuk menguji signifikansi model secara keseluruhan.

Evaluasi Model

Confusion Matrix digunakan untuk melihat distribusi antara hasil prediksi dan nilai aktual dari kelas Stress_Level.
Akurasi model dihitung sebagai proporsi jumlah prediksi yang benar terhadap total observasi.
Untuk model regresi logistik ordinal, dilakukan juga interpretasi terhadap odds ratio guna memahami seberapa besar pengaruh masing-masing variabel prediktor terhadap kemungkinan peningkatan tingkat stres.

#MODEL LDA
#1. persiapan data
library(dplyr)
library(tidyverse)
# Pastikan selected_vars sudah sesuai
selected_vars <- c("Course","Depression_Score", "Anxiety_Score", "Age_winsor", 
                   "Counseling_Service_Use")


lda_data <- dplyr::select(data, Stress_Level, dplyr::all_of(selected_vars)) %>%
  dplyr::mutate(dplyr::across(-Stress_Level, as.numeric))


#2. model lda
library(MASS)
lda_model <- lda(Stress_Level ~ ., data = lda_data)
print(lda_model)

## Call:
## lda(Stress_Level ~ ., data = lda_data)
## 
## Prior probabilities of groups:
## Stres Ringan Stres Sedang  Stres Berat 
##    0.3319567    0.3822273    0.2858160 
## 
## Group means:
##                Course Depression_Score Anxiety_Score Age_winsor
## Stres Ringan 3.475761         3.325611      3.364221   23.02960
## Stres Sedang 3.682936         3.263040      3.292474   22.94560
## Stres Berat  4.163926         3.160438      3.237170   22.81764
##              Counseling_Service_Use
## Stres Ringan               1.469755
## Stres Sedang               1.497765
## Stres Berat                1.501246
## 
## Coefficients of linear discriminants:
##                                LD1         LD2
## Course                  0.64529526  0.14695283
## Depression_Score       -0.05388436  0.13899196
## Anxiety_Score          -0.11922127  0.33482563
## Age_winsor             -0.03631326  0.03142163
## Counseling_Service_Use  0.11926662 -1.16128465
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9924 0.0076

#3. Prediksi dan Evaluasi Akurasi
# Lakukan prediksi
lda_pred <- predict(lda_model)

# Confusion Matrix
table(Predicted = lda_pred$class, Actual = lda_data$Stress_Level)

##               Actual
## Predicted      Stres Ringan Stres Sedang Stres Berat
##   Stres Ringan          529          518         237
##   Stres Sedang         1424         1729        1372
##   Stres Berat           378          437         398

# Hitung akurasi
# Konversi Stress_Level menjadi faktor biasa
lda_data$Stress_Level <- factor(as.character(lda_data$Stress_Level))

# Hitung akurasi
mean(lda_pred$class == lda_data$Stress_Level)

## [1] 0.3782398

#4. Visualisasi Plot fungsi diskriminan
library(ggplot2)

# Ekstrak skor LD dari prediksi
lda_values <- as.data.frame(lda_pred$x)
lda_values$Class <- lda_data$Stress_Level

# Plot LDA (LD1 vs LD2)
ggplot(lda_values, aes(x = LD1, y = LD2, color = Class)) +
  geom_point(alpha = 0.5) +
  stat_ellipse(level = 0.95, linetype = 2) +  # menambahkan ellipse
  labs(title = "Visualisasi Fungsi Diskriminan (LD1 vs LD2)",
       x = "LD1 (Fungsi Diskriminan 1)",
       y = "LD2 (Fungsi Diskriminan 2)") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

#MODEL Regresi Logistik Ordinal

#1. Pembentukan Model (Estimasi Parameter)
# Pastikan target adalah ordered factor
data$Stress_Level <- factor(data$Stress_Level, 
                            levels = c("Stres Ringan", "Stres Sedang", "Stres Berat"), 
                            ordered = TRUE)

# Konversi prediktor ke numerik jika belum
for (var in selected_vars) {
  data[[var]] <- as.numeric(data[[var]])
}

# Load library
library(MASS)

# Bangun model regresi logistik ordinal
model_polr <- polr(Stress_Level ~ Course + Depression_Score + Anxiety_Score + Age_winsor + 
                     Counseling_Service_Use,
                   data = data, Hess = TRUE)


# Tampilkan ringkasan model
summary(model_polr)

## Call:
## polr(formula = Stress_Level ~ Course + Depression_Score + Anxiety_Score + 
##     Age_winsor + Counseling_Service_Use, data = data, Hess = TRUE)
## 
## Coefficients:
##                           Value Std. Error t value
## Course                  0.21412   0.014937  14.334
## Depression_Score       -0.01823   0.013759  -1.325
## Anxiety_Score          -0.04127   0.013611  -3.032
## Age_winsor             -0.01264   0.006023  -2.099
## Counseling_Service_Use  0.04839   0.033189   1.458
## 
## Intercepts:
##                           Value   Std. Error t value
## Stres Ringan|Stres Sedang -0.3344  0.1713    -1.9522
## Stres Sedang|Stres Berat   1.3247  0.1721     7.6989
## 
## Residual Deviance: 15097.17 
## AIC: 15111.17

# 2. Uji Signifikansi Variabel
#a. Uji Parsial (Z-value & P-value)
(ctable <- coef(summary(model_polr)))

##                                 Value Std. Error   t value
## Course                     0.21411583 0.01493746 14.334157
## Depression_Score          -0.01823004 0.01375867 -1.324985
## Anxiety_Score             -0.04126707 0.01361146 -3.031789
## Age_winsor                -0.01264262 0.00602262 -2.099190
## Counseling_Service_Use     0.04839441 0.03318941  1.458128
## Stres Ringan|Stres Sedang -0.33440519 0.17130022 -1.952159
## Stres Sedang|Stres Berat   1.32472610 0.17206770  7.698866

p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))

##                                 Value Std. Error   t value      p value
## Course                     0.21411583 0.01493746 14.334157 1.338620e-46
## Depression_Score          -0.01823004 0.01375867 -1.324985 1.851760e-01
## Anxiety_Score             -0.04126707 0.01361146 -3.031789 2.431094e-03
## Age_winsor                -0.01264262 0.00602262 -2.099190 3.580019e-02
## Counseling_Service_Use     0.04839441 0.03318941  1.458128 1.448053e-01
## Stres Ringan|Stres Sedang -0.33440519 0.17130022 -1.952159 5.091938e-02
## Stres Sedang|Stres Berat   1.32472610 0.17206770  7.698866 1.372794e-14

# b. Uji Serentak (Likelihood Ratio Test)
null_model <- polr(Stress_Level ~ 1, data = data, Hess = TRUE)
anova(null_model, model_polr)

## Likelihood ratio tests of ordinal regression models
## 
## Response: Stress_Level
##                                                                             Model
## 1                                                                               1
## 2 Course + Depression_Score + Anxiety_Score + Age_winsor + Counseling_Service_Use
##   Resid. df Resid. Dev   Test    Df LR stat. Pr(Chi)
## 1      7020   15330.81                              
## 2      7015   15097.17 1 vs 2     5 233.6315       0

#3. Evaluasi Akurasi Model
# Prediksi kelas
pred <- predict(model_polr, data)

# Confusion matrix
table(Predicted = pred, Actual = data$Stress_Level)

##               Actual
## Predicted      Stres Ringan Stres Sedang Stres Berat
##   Stres Ringan          757          751         341
##   Stres Sedang         1383         1715        1550
##   Stres Berat           191          218         116

# Akurasi
# Konversi target dari ordered factor → factor biasa
data$Stress_Level <- factor(as.character(data$Stress_Level))

# Lakukan evaluasi akurasi
mean(pred == data$Stress_Level)

## [1] 0.368556

#4. Interpretasi (Odd Ratio)
exp(coef(model_polr))

##                 Course       Depression_Score          Anxiety_Score 
##              1.2387661              0.9819351              0.9595728 
##             Age_winsor Counseling_Service_Use 
##              0.9874370              1.0495845

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.