A. Informasi Dataset

Dataset yang digunakan bersumber dari Kaggle dan berisi informasi gaya hidup dan kesehatan individu, seperti aktivitas fisik, tekanan darah, serta gangguan tidur. Tujuan analisis ini adalah mengklasifikasi jenis gangguan tidur berdasarkan faktor gaya hidup.

Load data

library(DT)
## Warning: package 'DT' was built under R version 4.4.3
df <- read.csv("Sleep_health_and_lifestyle_dataset.csv")
datatable(df)

Statistik deskriptif

library(psych)
## Warning: package 'psych' was built under R version 4.4.3
describe(df)
##                         vars   n    mean      sd median trimmed     mad    min
## Person.ID                  1 374  187.50  108.11  187.5  187.50  138.62    1.0
## Gender*                    2 374    1.51    0.50    2.0    1.51    0.00    1.0
## Age                        3 374   42.18    8.67   43.0   41.84   10.38   27.0
## Occupation*                4 374    4.77    3.06    4.0    4.47    2.97    1.0
## Sleep.Duration             5 374    7.13    0.80    7.2    7.12    1.04    5.8
## Quality.of.Sleep           6 374    7.31    1.20    7.0    7.32    1.48    4.0
## Physical.Activity.Level    7 374   59.17   20.83   60.0   58.97   22.24   30.0
## Stress.Level               8 374    5.39    1.77    5.0    5.36    2.97    3.0
## BMI.Category*              9 374    2.30    1.43    1.0    2.25    0.00    1.0
## Blood.Pressure*           10 374   14.11    7.10   16.0   14.48    8.90    1.0
## Heart.Rate                11 374   70.17    4.14   70.0   69.74    2.97   65.0
## Daily.Steps               12 374 6816.84 1617.92 7000.0 6732.67 1482.60 3000.0
## Sleep.Disorder*           13 374    2.00    0.64    2.0    2.00    0.00    1.0
##                             max  range  skew kurtosis    se
## Person.ID                 374.0  373.0  0.00    -1.21  5.59
## Gender*                     2.0    1.0 -0.02    -2.00  0.03
## Age                        59.0   32.0  0.26    -0.92  0.45
## Occupation*                11.0   10.0  0.74    -0.53  0.16
## Sleep.Duration              8.5    2.7  0.04    -1.29  0.04
## Quality.of.Sleep            9.0    5.0 -0.21    -0.77  0.06
## Physical.Activity.Level    90.0   60.0  0.07    -1.27  1.08
## Stress.Level                8.0    5.0  0.15    -1.33  0.09
## BMI.Category*               4.0    3.0  0.28    -1.85  0.07
## Blood.Pressure*            25.0   24.0 -0.18    -0.96  0.37
## Heart.Rate                 86.0   21.0  1.22     2.21  0.21
## Daily.Steps             10000.0 7000.0  0.18    -0.42 83.66
## Sleep.Disorder*             3.0    2.0  0.00    -0.60  0.03

Nilai unik pada kolom kategorikal

cat_cols <- sapply(df, function(x) is.factor(x) || is.character(x))

for (col_name in names(df)[cat_cols]) {
  cat("Kolom:", col_name, "\n")
  print(table(df[[col_name]]))
  cat("\n---------------------\n\n")
}
## Kolom: Gender 
## 
## Female   Male 
##    185    189 
## 
## ---------------------
## 
## Kolom: Occupation 
## 
##           Accountant               Doctor             Engineer 
##                   37                   71                   63 
##               Lawyer              Manager                Nurse 
##                   47                    1                   73 
## Sales Representative          Salesperson            Scientist 
##                    2                   32                    4 
##    Software Engineer              Teacher 
##                    4                   40 
## 
## ---------------------
## 
## Kolom: BMI.Category 
## 
##        Normal Normal Weight         Obese    Overweight 
##           195            21            10           148 
## 
## ---------------------
## 
## Kolom: Blood.Pressure 
## 
## 115/75 115/78 117/76 118/75 118/76 119/77 120/80 121/79 122/80 125/80 125/82 
##     32      2      2      2      1      2     45      1      1     65      4 
## 126/83 128/84 128/85 129/84 130/85 130/86 131/86 132/87 135/88 135/90 139/91 
##      2      2      3      2     99      2      2      3      2     27      2 
## 140/90 140/95 142/92 
##      4     65      2 
## 
## ---------------------
## 
## Kolom: Sleep.Disorder 
## 
##    Insomnia        None Sleep Apnea 
##          77         219          78 
## 
## ---------------------

Cek missing value

missing_counts <- colSums(is.na(df))
print(missing_counts)
##               Person.ID                  Gender                     Age 
##                       0                       0                       0 
##              Occupation          Sleep.Duration        Quality.of.Sleep 
##                       0                       0                       0 
## Physical.Activity.Level            Stress.Level            BMI.Category 
##                       0                       0                       0 
##          Blood.Pressure              Heart.Rate             Daily.Steps 
##                       0                       0                       0 
##          Sleep.Disorder 
##                       0

B. Preprocessing Dataset

1. Hapus kolom yang tidak diperlukan (Person ID)

df$Person.ID <- NULL

2. Menangani data inkonsisten

Pada kolom BMI Category terdapat nilai yang tidak konsisten, yaitu Normal dan Normal Weight. Nilai Normal Weight diubah menjadi Normal agar konsisten.

df$BMI.Category <- ifelse(df$BMI.Category == "Normal Weight", "Normal", df$BMI.Category)

3. Memisahkan kolom Blood Pressure menjadi BP_Systolic dan BP_Diastolic

Tipe data pada kolom Blood Pressure merupakan kategorikal yang dimana berupa angka untuk merepresentasikan tekanan darah dalam sistolik dan diastolik. Pada data baris pertama yaitu 126/83, dimana nilai 126 adalah nilai tekanan darah sistolik dan nilai 83 adalah nilai tekanan darah diastolik. Nilai ini kemudian akan dipecah menjadi kolom BP_Systolic dan BP_Diastolic agar menjadi tipe data numerik.

bp_split <- strsplit(as.character(df$Blood.Pressure), "/")
df$BP_Systolic <- as.numeric(sapply(bp_split, `[`, 1))
df$BP_Diastolic <- as.numeric(sapply(bp_split, `[`, 2))

df$Blood.Pressure <- NULL

sleep_col <- df$Sleep.Disorder          
df$Sleep.Disorder <- NULL                
df$Sleep.Disorder <- sleep_col

4. Cek dan Handling Outlier

numeric_cols <- c("Age", "Sleep.Duration", "Quality.of.Sleep", "Physical.Activity.Level", "Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic", "BP_Diastolic")

par(mfrow = c(2, ceiling(length(numeric_cols) / 2)))

for (col in numeric_cols) {
  boxplot(df[[col]], main = col, col = "skyblue", horizontal = TRUE)
}

par(mfrow = c(1,1))

detect_outliers <- function(x) {
  z <- scale(x)
  which(abs(z) > 3)
}

outlier_index <- list()
for (var in numeric_cols) {
  outlier_index[[var]] <- detect_outliers(df[[var]])
  cat(var, ":", length(outlier_index[[var]]), "outlier\n")
}
## Age : 0 outlier
## Sleep.Duration : 0 outlier
## Quality.of.Sleep : 0 outlier
## Physical.Activity.Level : 0 outlier
## Stress.Level : 0 outlier
## Heart.Rate : 9 outlier
## Daily.Steps : 0 outlier
## BP_Systolic : 0 outlier
## BP_Diastolic : 0 outlier

Outlier diimputasi menggunakan nilai batas bawah atau batas atas

replace_outlier_with_bound <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR_val <- Q3 - Q1
  
  lower_bound <- Q1 - 1.5 * IQR_val
  upper_bound <- Q3 + 1.5 * IQR_val
  
  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound
  return(x)
}

numeric_cols <- c("Age", "Sleep.Duration", "Quality.of.Sleep", "Physical.Activity.Level", "Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic", "BP_Diastolic")
df[numeric_cols] <- lapply(df[numeric_cols], replace_outlier_with_bound)
detect_outliers <- function(x) {
  z <- scale(x)
  which(abs(z) > 3)
}

outlier_index <- list()
for (var in numeric_cols) {
  outlier_index[[var]] <- detect_outliers(df[[var]])
  cat(var, ":", length(outlier_index[[var]]), "outlier\n")
}
## Age : 0 outlier
## Sleep.Duration : 0 outlier
## Quality.of.Sleep : 0 outlier
## Physical.Activity.Level : 0 outlier
## Stress.Level : 0 outlier
## Heart.Rate : 0 outlier
## Daily.Steps : 0 outlier
## BP_Systolic : 0 outlier
## BP_Diastolic : 0 outlier

5. One Hot Encoding pada Kolom Kategorikal

Beberapa kolom kategorikal perlu diubah menjadi representasi numerik agar bisa digunakan dalam model. One Hot Encoding dilakukan pada kolom Gender, Occupation, dan BMI Category, dengan baseline yang dijadikan referensi sebagai berikut:

  • Gender:Female
  • Occupation: Software Engineer
  • BMI Category: Normal
df$Gender <- factor(df$Gender, levels = c("Female", "Male"))
df$Occupation <- factor(df$Occupation, levels = c("Software Engineer", sort(unique(df$Occupation))[sort(unique(df$Occupation)) != "Software Engineer"]))
df$BMI.Category <- factor(df$BMI.Category, levels = c("Normal", "Overweight", "Obese"))

gender_dummies <- model.matrix(~ Gender, data = df)[, -1, drop = FALSE]
occupation_dummies <- model.matrix(~ Occupation, data = df)[, -1, drop = FALSE]
bmi_dummies <- model.matrix(~ BMI.Category, data = df)[, -1, drop = FALSE]

colnames(gender_dummies) <- gsub("Gender", "Gender_", colnames(gender_dummies))
colnames(occupation_dummies) <- gsub("Occupation", "Occupation_", colnames(occupation_dummies))
colnames(bmi_dummies) <- gsub("BMI.Category", "BMI.Category_", colnames(bmi_dummies))

colnames(gender_dummies) <- gsub("[() ]", "", colnames(gender_dummies))
colnames(occupation_dummies) <- gsub("[() ]", "", colnames(occupation_dummies))
colnames(bmi_dummies) <- gsub("[() ]", "", colnames(bmi_dummies))

df$Gender <- NULL
df$Occupation <- NULL
df$BMI.Category <- NULL

df <- cbind(df, gender_dummies, occupation_dummies, bmi_dummies)
# atur urutan kolom
new_order <- c(
  "Age",
  "Gender_Male",
  grep("^Occupation_", names(df), value = TRUE),
  grep("^BMI.Category_", names(df), value = TRUE),
  "Sleep.Duration",
  "Quality.of.Sleep",
  "Physical.Activity.Level",
  "Stress.Level",
  "Heart.Rate",
  "Daily.Steps",
  "BP_Systolic",
  "BP_Diastolic",
  "Sleep.Disorder"
)

cols_existing <- names(df)
cols_to_keep <- intersect(new_order, cols_existing)
cols_leftover <- setdiff(cols_existing, cols_to_keep)

df <- df[, c(cols_to_keep, cols_leftover)]

6. Unordered Factorization pada label Sleep Disorder

Label kelas Sleep Disorder dikonversi dari bentuk string ke dalam tipe data faktor (unordered) agar algoritma klasifikasi seperti Analisis Diskriminan Linear (LDA) dan Regresi Multinomial dapat mengenali bahwa target merupakan kelas diskrit (nominal), bukan data numerik berurutan (ordinal).

df$Sleep.Disorder <- factor(df$Sleep.Disorder, levels = c("None", "Sleep Apnea", "Insomnia"))

C. Exploratory Data Analysis

Distribusi Sleep Disorder

table(df$Sleep.Disorder)
## 
##        None Sleep Apnea    Insomnia 
##         219          78          77
barplot(table(df$Sleep.Disorder), col = "steelblue", main = "Distribusi Sleep Disorder")

Korelasi antar variabel numerik

library(corrplot)
## corrplot 0.95 loaded
exclude_prefixes <- c("Gender_", "Occupation_", "BMI.Category_")
numeric_df <- df[sapply(df, is.numeric)]
numeric_df <- numeric_df[ , !grepl(paste0("^(", paste(exclude_prefixes, collapse = "|"), ")"), names(numeric_df))]

corrplot::corrplot(cor(numeric_df), method = "color", tl.cex = 1, addCoef.col = "black", number.cex = 0.7)

Berdasarkan hasil visualisasi korelasi, ditemukan bahwa BP_Systolic dan BP_Diastolic memiliki korelasi tinggi (>0.85). Untuk menghindari masalah multikolinearitas, variabel BP_Diastolic dihapus dari dataset. Begitu juga dengan Sleep.Duration dan Quality.of.Sleep yang berkorelasi tinggi, variabel Quality.of.Sleep dihapus dari dataset.

df$BP_Diastolic <- NULL
df$Quality.of.Sleep <- NULL

Visualisasi Biplot

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
numeric_cols <- c("Age", "Sleep.Duration", "Physical.Activity.Level", 
                  "Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic")

data_numeric <- df[, numeric_cols]
dist_matrix <- dist(scale(data_numeric))  

mds_result <- cmdscale(dist_matrix, k = 2)

mds_df <- data.frame(Dim1 = mds_result[,1],
                     Dim2 = mds_result[,2],
                     Sleep.Disorder = df$Sleep.Disorder)


ggplot(mds_df, aes(x = Dim1, y = Dim2, color = Sleep.Disorder)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(title = "MDS Visualization (Euclidean Distance)",
       x = "Dimension 1", y = "Dimension 2") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Dataset Bersih Setelah Preprocessing

datatable(df) 

Statistik Deskriptif Setelah Preprocessing

library(psych)
describe(df)
##                                vars   n    mean      sd median trimmed     mad
## Age                               1 374   42.18    8.67   43.0   41.84   10.38
## Gender_Male                       2 374    0.51    0.50    1.0    0.51    0.00
## Occupation_Accountant             3 374    0.10    0.30    0.0    0.00    0.00
## Occupation_Doctor                 4 374    0.19    0.39    0.0    0.11    0.00
## Occupation_Engineer               5 374    0.17    0.37    0.0    0.09    0.00
## Occupation_Lawyer                 6 374    0.13    0.33    0.0    0.03    0.00
## Occupation_Manager                7 374    0.00    0.05    0.0    0.00    0.00
## Occupation_Nurse                  8 374    0.20    0.40    0.0    0.12    0.00
## Occupation_SalesRepresentative    9 374    0.01    0.07    0.0    0.00    0.00
## Occupation_Salesperson           10 374    0.09    0.28    0.0    0.00    0.00
## Occupation_Scientist             11 374    0.01    0.10    0.0    0.00    0.00
## Occupation_Teacher               12 374    0.11    0.31    0.0    0.01    0.00
## BMI.Category_Overweight          13 374    0.40    0.49    0.0    0.37    0.00
## BMI.Category_Obese               14 374    0.03    0.16    0.0    0.00    0.00
## Sleep.Duration                   15 374    7.13    0.80    7.2    7.12    1.04
## Physical.Activity.Level          16 374   59.17   20.83   60.0   58.97   22.24
## Stress.Level                     17 374    5.39    1.77    5.0    5.36    2.97
## Heart.Rate                       18 374   69.97    3.57   70.0   69.74    2.97
## Daily.Steps                      19 374 6816.84 1617.92 7000.0 6732.67 1482.60
## BP_Systolic                      20 374  128.55    7.75  130.0  128.78    7.41
## Sleep.Disorder*                  21 374    1.62    0.81    1.0    1.53    0.00
##                                   min     max  range  skew kurtosis    se
## Age                              27.0    59.0   32.0  0.26    -0.92  0.45
## Gender_Male                       0.0     1.0    1.0 -0.02    -2.00  0.03
## Occupation_Accountant             0.0     1.0    1.0  2.68     5.17  0.02
## Occupation_Doctor                 0.0     1.0    1.0  1.58     0.48  0.02
## Occupation_Engineer               0.0     1.0    1.0  1.76     1.12  0.02
## Occupation_Lawyer                 0.0     1.0    1.0  2.25     3.07  0.02
## Occupation_Manager                0.0     1.0    1.0 19.18   367.02  0.00
## Occupation_Nurse                  0.0     1.0    1.0  1.53     0.35  0.02
## Occupation_SalesRepresentative    0.0     1.0    1.0 13.51   181.02  0.00
## Occupation_Salesperson            0.0     1.0    1.0  2.95     6.73  0.01
## Occupation_Scientist              0.0     1.0    1.0  9.48    88.02  0.01
## Occupation_Teacher                0.0     1.0    1.0  2.53     4.43  0.02
## BMI.Category_Overweight           0.0     1.0    1.0  0.42    -1.82  0.03
## BMI.Category_Obese                0.0     1.0    1.0  5.84    32.24  0.01
## Sleep.Duration                    5.8     8.5    2.7  0.04    -1.29  0.04
## Physical.Activity.Level          30.0    90.0   60.0  0.07    -1.27  1.08
## Stress.Level                      3.0     8.0    5.0  0.15    -1.33  0.09
## Heart.Rate                       65.0    78.0   13.0  0.47    -0.39  0.18
## Daily.Steps                    3000.0 10000.0 7000.0  0.18    -0.42 83.66
## BP_Systolic                     115.0   142.0   27.0 -0.04    -0.91  0.40
## Sleep.Disorder*                   1.0     3.0    2.0  0.79    -1.01  0.04

D. Linear Discriminant Analysis (LDA)

1. Uji Asumsi

a. Uji Asumsi Normal Multivariat

if (!require(MVN)) install.packages("MVN")
## Loading required package: MVN
## Warning: package 'MVN' was built under R version 4.4.3
library(MVN)

numeric_cols <- c("Age", "Sleep.Duration", "Physical.Activity.Level", "Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic")
mvn(df[numeric_cols], mvnTest = "hz", univariateTest = "AD", multivariatePlot = "qq")

## $multivariateNormality
##            Test       HZ p value MVN
## 1 Henze-Zirkler 21.37807       0  NO
## 
## $univariateNormality
##               Test                Variable Statistic   p value Normality
## 1 Anderson-Darling           Age              3.9372  <0.001      NO    
## 2 Anderson-Darling     Sleep.Duration         7.3040  <0.001      NO    
## 3 Anderson-Darling Physical.Activity.Level   11.3729  <0.001      NO    
## 4 Anderson-Darling      Stress.Level         12.6410  <0.001      NO    
## 5 Anderson-Darling       Heart.Rate           9.1373  <0.001      NO    
## 6 Anderson-Darling       Daily.Steps          8.9070  <0.001      NO    
## 7 Anderson-Darling       BP_Systolic          9.0151  <0.001      NO    
## 
## $Descriptives
##                           n        Mean      Std.Dev Median    Min     Max
## Age                     374   42.184492    8.6731335   43.0   27.0    59.0
## Sleep.Duration          374    7.132086    0.7956567    7.2    5.8     8.5
## Physical.Activity.Level 374   59.171123   20.8308037   60.0   30.0    90.0
## Stress.Level            374    5.385027    1.7745264    5.0    3.0     8.0
## Heart.Rate              374   69.965241    3.5672581   70.0   65.0    78.0
## Daily.Steps             374 6816.844920 1617.9156791 7000.0 3000.0 10000.0
## BP_Systolic             374  128.553476    7.7481176  130.0  115.0   142.0
##                            25th   75th        Skew   Kurtosis
## Age                       35.25   50.0  0.25516254 -0.9248041
## Sleep.Duration             6.40    7.8  0.03725369 -1.2945281
## Physical.Activity.Level   45.00   75.0  0.07389048 -1.2744695
## Stress.Level               4.00    7.0  0.15309385 -1.3345700
## Heart.Rate                68.00   72.0  0.46555442 -0.3878657
## Daily.Steps             5600.00 8000.0  0.17684985 -0.4186421
## BP_Systolic              125.00  135.0 -0.03538326 -0.9088701

Hasil ini menunjukkan bahwa p-value<0.05 maka tolak H0, artinya populasi tidak berdistribusi normal multivariat, sehingga asumsi normal multivariat tidak terpenuhi

b. Uji Asumsi Kehomogenan Varians

library(biotools)
## Warning: package 'biotools' was built under R version 4.4.3
## Loading required package: MASS
## ---
## biotools version 4.3
box_m_result <- boxM(df[, numeric_cols], grouping = df$Sleep.Disorder)

print(box_m_result)
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  df[, numeric_cols]
## Chi-Sq (approx.) = 833.06, df = 56, p-value < 2.2e-16

Hasil ini menunjukkan bahwa p-value<0.05 maka tolak H0, artinya varians tidak homogen, sehingga asumsi homogenitas varians tidak terpenuhi

2. Uji Signifikansi

a. uji signifikansi variabel secara keseluruhan

manova_model <- manova(as.matrix(df[, numeric_cols]) ~ Sleep.Disorder, data = df)
summary(manova_model, test = "Wilks")
##                 Df   Wilks approx F num Df den Df    Pr(>F)    
## Sleep.Disorder   2 0.26577   49.001     14    730 < 2.2e-16 ***
## Residuals      371                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value < 0.05 → Tolak H0: Terdapat paling tidak satu kombinasi linier dari variabel numerik yang dapat membedakan antar kelas Sleep Disorder.

b. Uji signifikansi per variabel

Gunakan ANOVA untuk melihat variabel mana yang signifikan secara individual:

anova_results <- sapply(numeric_cols, function(x) {
  model <- aov(as.formula(paste(x, "~ Sleep.Disorder")), data = df)
  summary(model)[[1]][["Pr(>F)"]][1]
})

anova_df <- data.frame(p_value = anova_results)
anova_df
##                              p_value
## Age                     8.852105e-23
## Sleep.Duration          1.626151e-13
## Physical.Activity.Level 6.306184e-18
## Stress.Level            1.520459e-03
## Heart.Rate              3.510225e-13
## Daily.Steps             7.942397e-11
## BP_Systolic             1.198027e-62

Semua variabel memiliki p-value < 0.05, artinya terdapat perbedaan rata-rata yang signifikan secara statistik untuk semua variabel numerik berdasarkan kelompok Sleep Disorder.

3. Pembentukan Model

library(MASS)
lda_model <- lda(Sleep.Disorder ~ ., data = df[, c(numeric_cols, "Sleep.Disorder")])
print(lda_model)
## Call:
## lda(Sleep.Disorder ~ ., data = df[, c(numeric_cols, "Sleep.Disorder")])
## 
## Prior probabilities of groups:
##        None Sleep Apnea    Insomnia 
##   0.5855615   0.2085561   0.2058824 
## 
## Group means:
##                  Age Sleep.Duration Physical.Activity.Level Stress.Level
## None        39.03653       7.358447                57.94977     5.114155
## Sleep Apnea 49.70513       7.032051                74.79487     5.666667
## Insomnia    43.51948       6.589610                46.81818     5.870130
##             Heart.Rate Daily.Steps BP_Systolic
## None          69.01826    6852.968    124.0457
## Sleep Apnea   72.44872    7619.231    137.7692
## Insomnia      70.14286    5901.299    132.0390
## 
## Coefficients of linear discriminants:
##                                   LD1           LD2
## Age                      6.793810e-02  0.0182843850
## Sleep.Duration          -8.130318e-01 -1.3462678312
## Physical.Activity.Level  2.210341e-03  0.0227071520
## Stress.Level            -3.381003e-01  0.2983645430
## Heart.Rate               2.024337e-01 -0.3276790399
## Daily.Steps              3.755992e-05 -0.0007085623
## BP_Systolic              1.134453e-01  0.0002235590
## 
## Proportion of trace:
##    LD1    LD2 
## 0.7977 0.2023

4. Akurasi

library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
lda_pred <- predict(lda_model, newdata = df[, numeric_cols])

# Confusion matrix
conf_matrix <- confusionMatrix(as.factor(lda_pred$class), as.factor(df$Sleep.Disorder))
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    None Sleep Apnea Insomnia
##   None         192           7        8
##   Sleep Apnea    6          65        6
##   Insomnia      21           6       63
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8556          
##                  95% CI : (0.8158, 0.8896)
##     No Information Rate : 0.5856          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7525          
##                                           
##  Mcnemar's Test P-Value : 0.1163          
## 
## Statistics by Class:
## 
##                      Class: None Class: Sleep Apnea Class: Insomnia
## Sensitivity               0.8767             0.8333          0.8182
## Specificity               0.9032             0.9595          0.9091
## Pos Pred Value            0.9275             0.8442          0.7000
## Neg Pred Value            0.8383             0.9562          0.9507
## Prevalence                0.5856             0.2086          0.2059
## Detection Rate            0.5134             0.1738          0.1684
## Detection Prevalence      0.5535             0.2059          0.2406
## Balanced Accuracy         0.8900             0.8964          0.8636

5. Visualisasi Hasil

Untuk mengetahui variabel mana yang berpengaruh terhadap gangguan tidur (Sleep Disorder), salah satu caranya adalah dengan melihat plot antara fungsi diskriminan.

plot(lda_model, col = as.integer(df[, c(numeric_cols, "Sleep.Disorder")]$Sleep.Disorder))

Dilihat dari plot diatas, dapat dikatakan model mampu mengelompokkan data dengan cukup baik walaupun terdapat overlap pada ketiga kategori. Dapat dilihat pula fungsi diskriminan LD1 berperan besar dalam membedakan antara kategori Sleep Disorder, sedangkan fungsi diskriminan LD2 tidak berperan besar dalam membedakan kategori Sleep Disorder.

E. Multinomial Regression

1. Pembentukan Model

if (!require(nnet)) install.packages("nnet")
## Loading required package: nnet
library(nnet)

model <- multinom(Sleep.Disorder ~ Age + Gender_Male + Occupation_Accountant + Occupation_Doctor + Occupation_Engineer +
                  Occupation_Lawyer + Occupation_Manager + Occupation_Nurse + Occupation_SalesRepresentative + Occupation_Salesperson +
                  Occupation_Scientist + Occupation_Teacher + BMI.Category_Overweight + BMI.Category_Obese + Sleep.Duration +
                  Physical.Activity.Level + Stress.Level + Heart.Rate + Daily.Steps + BP_Systolic,
                  data = df)
## # weights:  66 (42 variable)
## initial  value 410.880996 
## iter  10 value 232.513420
## iter  20 value 120.129947
## iter  30 value 113.374575
## iter  40 value 106.324151
## iter  50 value 104.656806
## iter  60 value 104.138955
## iter  70 value 104.104534
## iter  80 value 104.080944
## final  value 104.078396 
## converged
summary(model)
## Call:
## multinom(formula = Sleep.Disorder ~ Age + Gender_Male + Occupation_Accountant + 
##     Occupation_Doctor + Occupation_Engineer + Occupation_Lawyer + 
##     Occupation_Manager + Occupation_Nurse + Occupation_SalesRepresentative + 
##     Occupation_Salesperson + Occupation_Scientist + Occupation_Teacher + 
##     BMI.Category_Overweight + BMI.Category_Obese + Sleep.Duration + 
##     Physical.Activity.Level + Stress.Level + Heart.Rate + Daily.Steps + 
##     BP_Systolic, data = df)
## 
## Coefficients:
##             (Intercept)          Age Gender_Male Occupation_Accountant
## Sleep Apnea   -90.04220 -0.009096638  -0.2926666             -15.79232
## Insomnia      -93.06757  0.205253795   1.6394267              13.51101
##             Occupation_Doctor Occupation_Engineer Occupation_Lawyer
## Sleep Apnea         4.0354511            3.136961          3.332353
## Insomnia            0.2703967            5.618437          3.425211
##             Occupation_Manager Occupation_Nurse Occupation_SalesRepresentative
## Sleep Apnea         -14.666577         7.275768                       18.20412
## Insomnia             -6.201642         1.557523                      -17.82102
##             Occupation_Salesperson Occupation_Scientist Occupation_Teacher
## Sleep Apnea               7.498148             7.526992           8.656193
## Insomnia                  8.453898           -18.177851           7.871865
##             BMI.Category_Overweight BMI.Category_Obese Sleep.Duration
## Sleep Apnea              -0.5379552           47.15032       2.928505
## Insomnia                 -6.1155857           37.42976      -3.238395
##             Physical.Activity.Level Stress.Level Heart.Rate   Daily.Steps
## Sleep Apnea             -0.05838660   0.46028362  0.3989656  0.0008301033
## Insomnia                -0.02007838  -0.05453537  0.1822150 -0.0010739335
##             BP_Systolic
## Sleep Apnea   0.2300954
## Insomnia      0.7556189
## 
## Std. Errors:
##             (Intercept)        Age Gender_Male Occupation_Accountant
## Sleep Apnea 0.005460022 0.05977360  0.03373033          3.921671e-12
## Insomnia    0.004487709 0.05631287  0.02898579          1.991206e-02
##             Occupation_Doctor Occupation_Engineer Occupation_Lawyer
## Sleep Apnea        0.01852534          0.01386726       0.008904411
## Insomnia           0.01426357          0.03261901       0.021296859
##             Occupation_Manager Occupation_Nurse Occupation_SalesRepresentative
## Sleep Apnea       1.592468e-12       0.03449422                   1.042231e-14
## Insomnia          2.606316e-10       0.01676492                   1.038601e-14
##             Occupation_Salesperson Occupation_Scientist Occupation_Teacher
## Sleep Apnea            0.009559529         5.443588e-03        0.004355305
## Insomnia               0.041544018         3.695043e-13        0.007034011
##             BMI.Category_Overweight BMI.Category_Obese Sleep.Duration
## Sleep Apnea              0.03791723        0.003184474      0.1252124
## Insomnia                 0.06462222        0.003184474      0.1225856
##             Physical.Activity.Level Stress.Level Heart.Rate  Daily.Steps
## Sleep Apnea              0.02494531    0.1384985 0.09556256 0.0002878376
## Insomnia                 0.02370210    0.1721407 0.07180195 0.0003319279
##             BP_Systolic
## Sleep Apnea  0.06978849
## Insomnia     0.04760106
## 
## Residual Deviance: 208.1568 
## AIC: 292.1568

2. Uji Signifikansi Variabel

a. Uji Serentak

model_null <- multinom(Sleep.Disorder ~ 1, data = df, trace = FALSE)

anova_result <- anova(model_null, model, test = "Chisq")
print(anova_result)
## Likelihood ratio tests of Multinomial Models
## 
## Response: Sleep.Disorder
##                                                                                                                                                                                                                                                                                                                                                                                                     Model
## 1                                                                                                                                                                                                                                                                                                                                                                                                       1
## 2 Age + Gender_Male + Occupation_Accountant + Occupation_Doctor + Occupation_Engineer + Occupation_Lawyer + Occupation_Manager + Occupation_Nurse + Occupation_SalesRepresentative + Occupation_Salesperson + Occupation_Scientist + Occupation_Teacher + BMI.Category_Overweight + BMI.Category_Obese + Sleep.Duration + Physical.Activity.Level + Stress.Level + Heart.Rate + Daily.Steps + BP_Systolic
##   Resid. df Resid. Dev   Test    Df LR stat. Pr(Chi)
## 1       746   722.3373                              
## 2       706   208.1568 1 vs 2    40 514.1805       0

Berdasarkan uji Likelihood Ratio, terdapat perbedaan yang signifikan antara model intercept dan model penuh (χ²(40) = 514.18, p < 0.001). Hal ini menunjukkan bahwa variabel-variabel prediktor secara bersama-sama berkontribusi signifikan dalam memprediksi gangguan tidur (Sleep Disorder).

b. Uji Parsial

model_summary <- summary(model)

coefs <- model_summary$coefficients
std_err <- model_summary$standard.errors

z_values <- coefs / std_err
p_values <- 2 * (1 - pnorm(abs(z_values)))

print(round(p_values, 4))
##             (Intercept)    Age Gender_Male Occupation_Accountant
## Sleep Apnea           0 0.8790           0                     0
## Insomnia              0 0.0003           0                     0
##             Occupation_Doctor Occupation_Engineer Occupation_Lawyer
## Sleep Apnea                 0                   0                 0
## Insomnia                    0                   0                 0
##             Occupation_Manager Occupation_Nurse Occupation_SalesRepresentative
## Sleep Apnea                  0                0                              0
## Insomnia                     0                0                              0
##             Occupation_Salesperson Occupation_Scientist Occupation_Teacher
## Sleep Apnea                      0                    0                  0
## Insomnia                         0                    0                  0
##             BMI.Category_Overweight BMI.Category_Obese Sleep.Duration
## Sleep Apnea                       0                  0              0
## Insomnia                          0                  0              0
##             Physical.Activity.Level Stress.Level Heart.Rate Daily.Steps
## Sleep Apnea                  0.0193       0.0009     0.0000      0.0039
## Insomnia                     0.3969       0.7514     0.0112      0.0012
##             BP_Systolic
## Sleep Apnea       0.001
## Insomnia          0.000

Berdasarkan uji parsial, ditemukan bahwa sebagian besar variabel prediktor berkontribusi signifikan secara statistik terhadap probabilitas individu mengalami gangguan tidur, baik Sleep Apnea maupun Insomnia.

Untuk Sleep Apnea, variabel seperti Heart Rate (p = 0.0000), Stress Level (p = 0.0009), dan Daily Steps (p = 0.0039) menunjukkan pengaruh individual yang signifikan terhadap prediksi gangguan tidur. Sementara itu, variabel seperti Age (p = 0.8790) tidak menunjukkan kontribusi signifikan secara individual.

Sedangkan untuk Insomnia, variabel seperti Age (p = 0.0003), Heart Rate (p = 0.0112), dan BP_Systolic (p = 0.0000) berperan penting dalam model prediktif, sedangkan variabel seperti Physical Activity Level (p = 0.3969) dan Stress Level (p = 0.7514) tidak berkontribusi secara signifikan secara individual.

3. Akurasi

mlr_pred <- predict(model, newdata = df)

library(caret)
conf_matrix <- confusionMatrix(mlr_pred, df$Sleep.Disorder)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    None Sleep Apnea Insomnia
##   None         210           5        7
##   Sleep Apnea    5          71        3
##   Insomnia       4           2       67
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9305          
##                  95% CI : (0.8998, 0.9541)
##     No Information Rate : 0.5856          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8776          
##                                           
##  Mcnemar's Test P-Value : 0.7969          
## 
## Statistics by Class:
## 
##                      Class: None Class: Sleep Apnea Class: Insomnia
## Sensitivity               0.9589             0.9103          0.8701
## Specificity               0.9226             0.9730          0.9798
## Pos Pred Value            0.9459             0.8987          0.9178
## Neg Pred Value            0.9408             0.9763          0.9668
## Prevalence                0.5856             0.2086          0.2059
## Detection Rate            0.5615             0.1898          0.1791
## Detection Prevalence      0.5936             0.2112          0.1952
## Balanced Accuracy         0.9407             0.9416          0.9250

4. Interpretasi (Odds Ratio)

data.frame(exp(summary(model)$coefficients))
##             X.Intercept.       Age Gender_Male Occupation_Accountant
## Sleep Apnea 7.855423e-40 0.9909446   0.7462709          1.385106e-07
## Insomnia    3.813004e-41 1.2278366   5.1522150          7.374905e+05
##             Occupation_Doctor Occupation_Engineer Occupation_Lawyer
## Sleep Apnea         56.568431            23.03376          28.00415
## Insomnia             1.310484           275.45837          30.72912
##             Occupation_Manager Occupation_Nurse Occupation_SalesRepresentative
## Sleep Apnea       4.269592e-07      1444.859837                   8.052818e+07
## Insomnia          2.026100e-03         4.747049                   1.821502e-08
##             Occupation_Salesperson Occupation_Scientist Occupation_Teacher
## Sleep Apnea               1804.697         1.857511e+03           5745.617
## Insomnia                  4693.331         1.274851e-08           2622.452
##             BMI.Category_Overweight BMI.Category_Obese Sleep.Duration
## Sleep Apnea             0.583941052       3.000030e+20    18.69965262
## Insomnia                0.002208182       1.801106e+16     0.03922682
##             Physical.Activity.Level Stress.Level Heart.Rate Daily.Steps
## Sleep Apnea               0.9432852     1.584523   1.490282   1.0008304
## Insomnia                  0.9801218     0.946925   1.199872   0.9989266
##             BP_Systolic
## Sleep Apnea    1.258720
## Insomnia       2.128929

Berdasarkan hasil odds ratio, beberapa variabel memiliki pengaruh besar terhadap peningkatan risiko Sleep Apnea dan Insomnia:

A. Sleep Apnea
- Occupation_Teacher (OR = 5745.62), Occupation_Scientist (OR = 1857.51), Occupation_Salesperson (OR = 1804.70), dan Occupation_Nurse (OR = 1444.86) menunjukkan bahwa individu dengan pekerjaan ini memiliki peluang yang sangat tinggi mengalami Sleep Apnea dibandingkan kategori referensi (Occupation_SoftwareEngineer).
- Sleep Duration (OR = 18.70) dan BMI_Obese (OR = 3.000030e+20) juga berkontribusi signifikan menaikkan risiko Sleep Apnea.
- Physical Activity Level (OR = 0.9432852) menurunkan risiko Sleep Apnea.

B. Insomnia
- Occupation_Salesperson (OR = 4693.33) dan Occupation_Teacher (OR = 2622.45) memiliki pengaruh besar terhadap peningkatan risiko Insomnia.
- BMI_Obese (OR = 1.801106e+16) sangat meningkatkan risiko Insomnia.
- Sleep Duration (OR = 0.039) dan Daily Steps (OR = 0.9989266) menurunkan risiko Insomnia.