A. Informasi Dataset

Dataset yang digunakan bersumber dari Kaggle dan berisi informasi gaya hidup dan kesehatan individu, seperti aktivitas fisik, tekanan darah, serta gangguan tidur. Tujuan analisis ini adalah mengklasifikasi jenis gangguan tidur berdasarkan faktor gaya hidup.

library(DT)

## Warning: package 'DT' was built under R version 4.4.3

df <- read.csv("Sleep_health_and_lifestyle_dataset.csv")
datatable(df)

Statistik Deskriptif

library(psych)

## Warning: package 'psych' was built under R version 4.4.3

describe(df)

##                         vars   n    mean      sd median trimmed     mad    min
## Person.ID                  1 374  187.50  108.11  187.5  187.50  138.62    1.0
## Gender*                    2 374    1.51    0.50    2.0    1.51    0.00    1.0
## Age                        3 374   42.18    8.67   43.0   41.84   10.38   27.0
## Occupation*                4 374    4.77    3.06    4.0    4.47    2.97    1.0
## Sleep.Duration             5 374    7.13    0.80    7.2    7.12    1.04    5.8
## Quality.of.Sleep           6 374    7.31    1.20    7.0    7.32    1.48    4.0
## Physical.Activity.Level    7 374   59.17   20.83   60.0   58.97   22.24   30.0
## Stress.Level               8 374    5.39    1.77    5.0    5.36    2.97    3.0
## BMI.Category*              9 374    2.30    1.43    1.0    2.25    0.00    1.0
## Blood.Pressure*           10 374   14.11    7.10   16.0   14.48    8.90    1.0
## Heart.Rate                11 374   70.17    4.14   70.0   69.74    2.97   65.0
## Daily.Steps               12 374 6816.84 1617.92 7000.0 6732.67 1482.60 3000.0
## Sleep.Disorder*           13 374    2.00    0.64    2.0    2.00    0.00    1.0
##                             max  range  skew kurtosis    se
## Person.ID                 374.0  373.0  0.00    -1.21  5.59
## Gender*                     2.0    1.0 -0.02    -2.00  0.03
## Age                        59.0   32.0  0.26    -0.92  0.45
## Occupation*                11.0   10.0  0.74    -0.53  0.16
## Sleep.Duration              8.5    2.7  0.04    -1.29  0.04
## Quality.of.Sleep            9.0    5.0 -0.21    -0.77  0.06
## Physical.Activity.Level    90.0   60.0  0.07    -1.27  1.08
## Stress.Level                8.0    5.0  0.15    -1.33  0.09
## BMI.Category*               4.0    3.0  0.28    -1.85  0.07
## Blood.Pressure*            25.0   24.0 -0.18    -0.96  0.37
## Heart.Rate                 86.0   21.0  1.22     2.21  0.21
## Daily.Steps             10000.0 7000.0  0.18    -0.42 83.66
## Sleep.Disorder*             3.0    2.0  0.00    -0.60  0.03

Nilai Unik Pada Kolom Kategorikal

cat_cols <- sapply(df, function(x) is.factor(x) || is.character(x))

for (col_name in names(df)[cat_cols]) {
  cat("Kolom:", col_name, "\n")
  print(table(df[[col_name]]))
  cat("\n---------------------\n\n")
}

## Kolom: Gender 
## 
## Female   Male 
##    185    189 
## 
## ---------------------
## 
## Kolom: Occupation 
## 
##           Accountant               Doctor             Engineer 
##                   37                   71                   63 
##               Lawyer              Manager                Nurse 
##                   47                    1                   73 
## Sales Representative          Salesperson            Scientist 
##                    2                   32                    4 
##    Software Engineer              Teacher 
##                    4                   40 
## 
## ---------------------
## 
## Kolom: BMI.Category 
## 
##        Normal Normal Weight         Obese    Overweight 
##           195            21            10           148 
## 
## ---------------------
## 
## Kolom: Blood.Pressure 
## 
## 115/75 115/78 117/76 118/75 118/76 119/77 120/80 121/79 122/80 125/80 125/82 
##     32      2      2      2      1      2     45      1      1     65      4 
## 126/83 128/84 128/85 129/84 130/85 130/86 131/86 132/87 135/88 135/90 139/91 
##      2      2      3      2     99      2      2      3      2     27      2 
## 140/90 140/95 142/92 
##      4     65      2 
## 
## ---------------------
## 
## Kolom: Sleep.Disorder 
## 
##    Insomnia        None Sleep Apnea 
##          77         219          78 
## 
## ---------------------

Cek Missing Value

missing_counts <- colSums(is.na(df))
print(missing_counts)

##               Person.ID                  Gender                     Age 
##                       0                       0                       0 
##              Occupation          Sleep.Duration        Quality.of.Sleep 
##                       0                       0                       0 
## Physical.Activity.Level            Stress.Level            BMI.Category 
##                       0                       0                       0 
##          Blood.Pressure              Heart.Rate             Daily.Steps 
##                       0                       0                       0 
##          Sleep.Disorder 
##                       0

B. Preprocessing Dataset

1. Hapus kolom yang tidak diperlukan

df$Person.ID <- NULL

2. Menangani data inkonsisten

Pada kolom BMI Category terdapat nilai yang tidak konsisten, yaitu Normal dan Normal Weight, serta Overweight dan Obese. Nilai Normal Weight diubah menjadi Normal, dan nilai Obese diubah menjadi Overweight agar konsisten.

df$BMI.Category <- ifelse(df$BMI.Category == "Normal Weight", "Normal", df$BMI.Category)
df$BMI.Category <- ifelse(df$BMI.Category == "Obese", "Overweight", df$BMI.Category)

3. Memisahkan kolom `Blood Pressure` menjadi `BP_Systolic` dan `BP_Diastolic`

Tipe data pada kolom Blood Pressure merupakan kategorikal yang dimana berupa angka untuk merepresentasikan tekanan darah dalam sistolik dan diastolik. Pada data baris pertama yaitu 126/83, dimana nilai 126 adalah nilai tekanan darah sistolik dan nilai 83 adalah nilai tekanan darah diastolik. Nilai ini kemudian akan dipecah menjadi kolom BP_Systolic dan BP_Diastolic agar menjadi tipe data numerik.

bp_split <- strsplit(as.character(df$Blood.Pressure), "/")
df$BP_Systolic <- as.numeric(sapply(bp_split, `[`, 1))
df$BP_Diastolic <- as.numeric(sapply(bp_split, `[`, 2))

df$Blood.Pressure <- NULL

sleep_col <- df$Sleep.Disorder          
df$Sleep.Disorder <- NULL                
df$Sleep.Disorder <- sleep_col

4. Cek dan Handling Outlier

numeric_cols <- c("Age", "Sleep.Duration", "Quality.of.Sleep", "Physical.Activity.Level", "Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic", "BP_Diastolic")

par(mfrow = c(2, ceiling(length(numeric_cols) / 2)))

for (col in numeric_cols) {
  boxplot(df[[col]], main = col, col = "skyblue", horizontal = TRUE)
}

par(mfrow = c(1,1))

detect_outliers <- function(x) {
  z <- scale(x)
  which(abs(z) > 3)
}

outlier_index <- list()
for (var in numeric_cols) {
  outlier_index[[var]] <- detect_outliers(df[[var]])
  cat(var, ":", length(outlier_index[[var]]), "outlier\n")
}

## Age : 0 outlier
## Sleep.Duration : 0 outlier
## Quality.of.Sleep : 0 outlier
## Physical.Activity.Level : 0 outlier
## Stress.Level : 0 outlier
## Heart.Rate : 9 outlier
## Daily.Steps : 0 outlier
## BP_Systolic : 0 outlier
## BP_Diastolic : 0 outlier

Outlier diimputasi menggunakan nilai batas bawah atau batas atas.

replace_outlier_with_bound <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR_val <- Q3 - Q1
  
  lower_bound <- Q1 - 1.5 * IQR_val
  upper_bound <- Q3 + 1.5 * IQR_val
  
  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound
  return(x)
}

numeric_cols <- c("Age", "Sleep.Duration", "Quality.of.Sleep", "Physical.Activity.Level", "Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic", "BP_Diastolic")
df[numeric_cols] <- lapply(df[numeric_cols], replace_outlier_with_bound)

detect_outliers <- function(x) {
  z <- scale(x)
  which(abs(z) > 3)
}

outlier_index <- list()
for (var in numeric_cols) {
  outlier_index[[var]] <- detect_outliers(df[[var]])
  cat(var, ":", length(outlier_index[[var]]), "outlier\n")
}

## Age : 0 outlier
## Sleep.Duration : 0 outlier
## Quality.of.Sleep : 0 outlier
## Physical.Activity.Level : 0 outlier
## Stress.Level : 0 outlier
## Heart.Rate : 0 outlier
## Daily.Steps : 0 outlier
## BP_Systolic : 0 outlier
## BP_Diastolic : 0 outlier

5. One Hot Encoding pada Kolom Kategorikal (Dummy Variabel)

Beberapa kolom kategorikal perlu diubah menjadi representasi numerik agar bisa digunakan dalam model Multinomial Regression. Proses One Hot Encoding dilakukan pada kolom Gender, Occupation, dan BMI Category, dengan pendekatan baseline agar menghindari dummy variable trap. Berikut baseline yang digunakan:

Gender: Female
Occupation: Manager
BMI Category: Normal

df$Gender <- factor(df$Gender, levels = c("Female", "Male"))
df$Occupation <- factor(df$Occupation, levels = c("Manager", sort(unique(df$Occupation))[sort(unique(df$Occupation)) != "Manager"]))
df$BMI.Category <- factor(df$BMI.Category, levels = c("Normal", "Overweight"))

gender_dummies <- model.matrix(~ Gender, data = df)[, -1, drop = FALSE]
occupation_dummies <- model.matrix(~ Occupation, data = df)[, -1, drop = FALSE]
bmi_dummies <- model.matrix(~ BMI.Category, data = df)[, -1, drop = FALSE]

colnames(gender_dummies) <- gsub("Gender", "Gender_", colnames(gender_dummies))
colnames(occupation_dummies) <- gsub("Occupation", "Occupation_", colnames(occupation_dummies))
colnames(bmi_dummies) <- gsub("BMI.Category", "BMI.Category_", colnames(bmi_dummies))

colnames(gender_dummies) <- gsub("[() ]", "", colnames(gender_dummies))
colnames(occupation_dummies) <- gsub("[() ]", "", colnames(occupation_dummies))
colnames(bmi_dummies) <- gsub("[() ]", "", colnames(bmi_dummies))

df$Gender <- NULL
df$Occupation <- NULL
df$BMI.Category <- NULL

df <- cbind(df, gender_dummies, occupation_dummies, bmi_dummies)

Untuk menghindari sparsity dan menjaga balance antar kategori, pekerjaan dikelompokkan menjadi bidang kerja yang lebih umum.

Occ_Healthcare: Doctor, Nurse
Occ_Public_Service: Lawyer, Teacher
Occ_STEM_Technical: Engineer, Scientist, Software Engineer
Occ_Business_Sales (baseline): Accountant, Manager, SalesRepresentative, Salesperson

df$Occ_Healthcare <- with(df, Occupation_Doctor + Occupation_Nurse)
df$Occ_Public_Service <- with(df, Occupation_Lawyer + Occupation_Teacher)
df$Occ_STEM_Technical <- with(df, 
  Occupation_Engineer + Occupation_Scientist + Occupation_SoftwareEngineer
)

df <- df[, !grepl("^Occupation_", names(df))]

# atur urutan kolom
new_order <- c(
  "Age",
  "Gender_Male",
  grep("^Occ_", names(df), value = TRUE),
  grep("^BMI.Category_", names(df), value = TRUE),
  "Sleep.Duration",
  "Quality.of.Sleep",
  "Physical.Activity.Level",
  "Stress.Level",
  "Heart.Rate",
  "Daily.Steps",
  "BP_Systolic",
  "BP_Diastolic",
  "Sleep.Disorder"
)

cols_existing <- names(df)
cols_to_keep <- intersect(new_order, cols_existing)
cols_leftover <- setdiff(cols_existing, cols_to_keep)

df <- df[, c(cols_to_keep, cols_leftover)]

6. Unordered Factorization pada label `Sleep Disorder`

Label kelas Sleep Disorder dikonversi dari bentuk string ke dalam tipe data faktor (unordered) agar algoritma klasifikasi seperti Analisis Diskriminan Linear (LDA) dan Regresi Multinomial dapat mengenali bahwa target merupakan kelas diskrit (nominal), bukan data numerik berurutan (ordinal).

df$Sleep.Disorder <- factor(df$Sleep.Disorder, levels = c("None", "Sleep Apnea", "Insomnia"))

C. Exploratory Data Analysis

Distribusi Sleep Disorder

table(df$Sleep.Disorder)

## 
##        None Sleep Apnea    Insomnia 
##         219          78          77

barplot(table(df$Sleep.Disorder), col = "steelblue", main = "Distribusi Sleep Disorder")

Heatmap Korelasi

library(corrplot)

## corrplot 0.95 loaded

exclude_prefixes <- c("Gender_", "Occ_", "BMI.Category_")
numeric_df <- df[sapply(df, is.numeric)]
numeric_df <- numeric_df[ , !grepl(paste0("^(", paste(exclude_prefixes, collapse = "|"), ")"), names(numeric_df))]

corrplot::corrplot(cor(numeric_df), method = "color", tl.cex = 1, addCoef.col = "black", number.cex = 0.7)

Berdasarkan hasil visualisasi korelasi, ditemukan bahwa BP_Systolic dan BP_Diastolic memiliki korelasi tinggi (>0.85). Untuk menghindari masalah multikolinearitas, variabel BP_Diastolic dihapus dari dataset. Begitu juga dengan Sleep.Duration dan Quality.of.Sleep yang berkorelasi tinggi, variabel Quality.of.Sleep dihapus dari dataset.

df$BP_Diastolic <- NULL
df$Quality.of.Sleep <- NULL

Visualisasi MDS

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

numeric_cols <- c("Age", "Sleep.Duration", "Physical.Activity.Level", 
                  "Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic")

data_numeric <- df[, numeric_cols]
dist_matrix <- dist(scale(data_numeric))  

mds_result <- cmdscale(dist_matrix, k = 2)

mds_df <- data.frame(Dim1 = mds_result[,1],
                     Dim2 = mds_result[,2],
                     Sleep.Disorder = df$Sleep.Disorder)


ggplot(mds_df, aes(x = Dim1, y = Dim2, color = Sleep.Disorder)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(title = "MDS Visualization (Euclidean Distance)",
       x = "Dimension 1", y = "Dimension 2") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Dataset Bersih Setelah Preprocessing

datatable(df)

Statistik Deskriptif Setelah Preprocessing

library(psych)
describe(df)

##                         vars   n    mean      sd median trimmed     mad    min
## Age                        1 374   42.18    8.67   43.0   41.84   10.38   27.0
## Gender_Male                2 374    0.51    0.50    1.0    0.51    0.00    0.0
## Occ_Healthcare             3 374    0.39    0.49    0.0    0.36    0.00    0.0
## Occ_Public_Service         4 374    0.23    0.42    0.0    0.17    0.00    0.0
## Occ_STEM_Technical         5 374    0.19    0.39    0.0    0.11    0.00    0.0
## BMI.Category_Overweight    6 374    0.42    0.49    0.0    0.40    0.00    0.0
## Sleep.Duration             7 374    7.13    0.80    7.2    7.12    1.04    5.8
## Physical.Activity.Level    8 374   59.17   20.83   60.0   58.97   22.24   30.0
## Stress.Level               9 374    5.39    1.77    5.0    5.36    2.97    3.0
## Heart.Rate                10 374   69.97    3.57   70.0   69.74    2.97   65.0
## Daily.Steps               11 374 6816.84 1617.92 7000.0 6732.67 1482.60 3000.0
## BP_Systolic               12 374  128.55    7.75  130.0  128.78    7.41  115.0
## Sleep.Disorder*           13 374    1.62    0.81    1.0    1.53    0.00    1.0
##                             max  range  skew kurtosis    se
## Age                        59.0   32.0  0.26    -0.92  0.45
## Gender_Male                 1.0    1.0 -0.02    -2.00  0.03
## Occ_Healthcare              1.0    1.0  0.47    -1.78  0.03
## Occ_Public_Service          1.0    1.0  1.26    -0.41  0.02
## Occ_STEM_Technical          1.0    1.0  1.58     0.48  0.02
## BMI.Category_Overweight     1.0    1.0  0.31    -1.91  0.03
## Sleep.Duration              8.5    2.7  0.04    -1.29  0.04
## Physical.Activity.Level    90.0   60.0  0.07    -1.27  1.08
## Stress.Level                8.0    5.0  0.15    -1.33  0.09
## Heart.Rate                 78.0   13.0  0.47    -0.39  0.18
## Daily.Steps             10000.0 7000.0  0.18    -0.42 83.66
## BP_Systolic               142.0   27.0 -0.04    -0.91  0.40
## Sleep.Disorder*             3.0    2.0  0.79    -1.01  0.04

D. Linear Discriminant Analysis

1. Uji Asumsi

a. Uji Asumsi Normal Multivariat

if (!require(MVN)) install.packages("MVN")

## Loading required package: MVN

## Warning: package 'MVN' was built under R version 4.4.3

library(MVN)

numeric_cols <- c("Age", "Sleep.Duration", "Physical.Activity.Level", "Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic")
mvn(df[numeric_cols], mvnTest = "hz", univariateTest = "AD", multivariatePlot = "qq")

## $multivariateNormality
##            Test       HZ p value MVN
## 1 Henze-Zirkler 21.37807       0  NO
## 
## $univariateNormality
##               Test                Variable Statistic   p value Normality
## 1 Anderson-Darling           Age              3.9372  <0.001      NO    
## 2 Anderson-Darling     Sleep.Duration         7.3040  <0.001      NO    
## 3 Anderson-Darling Physical.Activity.Level   11.3729  <0.001      NO    
## 4 Anderson-Darling      Stress.Level         12.6410  <0.001      NO    
## 5 Anderson-Darling       Heart.Rate           9.1373  <0.001      NO    
## 6 Anderson-Darling       Daily.Steps          8.9070  <0.001      NO    
## 7 Anderson-Darling       BP_Systolic          9.0151  <0.001      NO    
## 
## $Descriptives
##                           n        Mean      Std.Dev Median    Min     Max
## Age                     374   42.184492    8.6731335   43.0   27.0    59.0
## Sleep.Duration          374    7.132086    0.7956567    7.2    5.8     8.5
## Physical.Activity.Level 374   59.171123   20.8308037   60.0   30.0    90.0
## Stress.Level            374    5.385027    1.7745264    5.0    3.0     8.0
## Heart.Rate              374   69.965241    3.5672581   70.0   65.0    78.0
## Daily.Steps             374 6816.844920 1617.9156791 7000.0 3000.0 10000.0
## BP_Systolic             374  128.553476    7.7481176  130.0  115.0   142.0
##                            25th   75th        Skew   Kurtosis
## Age                       35.25   50.0  0.25516254 -0.9248041
## Sleep.Duration             6.40    7.8  0.03725369 -1.2945281
## Physical.Activity.Level   45.00   75.0  0.07389048 -1.2744695
## Stress.Level               4.00    7.0  0.15309385 -1.3345700
## Heart.Rate                68.00   72.0  0.46555442 -0.3878657
## Daily.Steps             5600.00 8000.0  0.17684985 -0.4186421
## BP_Systolic              125.00  135.0 -0.03538326 -0.9088701

Hasil ini menunjukkan bahwa p-value<0.05 maka tolak H0, artinya populasi tidak berdistribusi normal multivariat, sehingga asumsi normal multivariat tidak terpenuhi

b. Uji Asumsi Kehomogenan Varians

library(biotools)

## Warning: package 'biotools' was built under R version 4.4.3

## Loading required package: MASS

## ---
## biotools version 4.3

box_m_result <- boxM(df[, numeric_cols], grouping = df$Sleep.Disorder)

print(box_m_result)

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  df[, numeric_cols]
## Chi-Sq (approx.) = 833.06, df = 56, p-value < 2.2e-16

Hasil ini menunjukkan bahwa p-value<0.05 maka tolak H0, artinya varians tidak homogen, sehingga asumsi homogenitas varians tidak terpenuhi

2. Uji Signifikansi

a. Uji Signifikansi Variabel Secara Keseluruhan

manova_model <- manova(as.matrix(df[, numeric_cols]) ~ Sleep.Disorder, data = df)
summary(manova_model, test = "Wilks")

##                 Df   Wilks approx F num Df den Df    Pr(>F)    
## Sleep.Disorder   2 0.26577   49.001     14    730 < 2.2e-16 ***
## Residuals      371                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value < 0.05 → Tolak H0: Terdapat paling tidak satu kombinasi linier dari variabel numerik yang dapat membedakan antar kelas Sleep Disorder.

b. Uji Signifikansi Per Variabel

anova_results <- sapply(numeric_cols, function(x) {
  model <- aov(as.formula(paste(x, "~ Sleep.Disorder")), data = df)
  summary(model)[[1]][["Pr(>F)"]][1]
})

anova_df <- data.frame(p_value = anova_results)
anova_df

##                              p_value
## Age                     8.852105e-23
## Sleep.Duration          1.626151e-13
## Physical.Activity.Level 6.306184e-18
## Stress.Level            1.520459e-03
## Heart.Rate              3.510225e-13
## Daily.Steps             7.942397e-11
## BP_Systolic             1.198027e-62

Semua variabel memiliki p-value < 0.05, artinya terdapat perbedaan rata-rata yang signifikan secara statistik untuk semua variabel numerik berdasarkan kelompok Sleep Disorder.

3. Pembentukan Model

library(MASS)
lda_model <- lda(Sleep.Disorder ~ ., data = df[, c(numeric_cols, "Sleep.Disorder")])
print(lda_model)

## Call:
## lda(Sleep.Disorder ~ ., data = df[, c(numeric_cols, "Sleep.Disorder")])
## 
## Prior probabilities of groups:
##        None Sleep Apnea    Insomnia 
##   0.5855615   0.2085561   0.2058824 
## 
## Group means:
##                  Age Sleep.Duration Physical.Activity.Level Stress.Level
## None        39.03653       7.358447                57.94977     5.114155
## Sleep Apnea 49.70513       7.032051                74.79487     5.666667
## Insomnia    43.51948       6.589610                46.81818     5.870130
##             Heart.Rate Daily.Steps BP_Systolic
## None          69.01826    6852.968    124.0457
## Sleep Apnea   72.44872    7619.231    137.7692
## Insomnia      70.14286    5901.299    132.0390
## 
## Coefficients of linear discriminants:
##                                   LD1           LD2
## Age                      6.793810e-02  0.0182843850
## Sleep.Duration          -8.130318e-01 -1.3462678312
## Physical.Activity.Level  2.210341e-03  0.0227071520
## Stress.Level            -3.381003e-01  0.2983645430
## Heart.Rate               2.024337e-01 -0.3276790399
## Daily.Steps              3.755992e-05 -0.0007085623
## BP_Systolic              1.134453e-01  0.0002235590
## 
## Proportion of trace:
##    LD1    LD2 
## 0.7977 0.2023

4. Akurasi

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice

lda_pred <- predict(lda_model, newdata = df[, numeric_cols])

# Confusion matrix
conf_matrix <- confusionMatrix(as.factor(lda_pred$class), as.factor(df$Sleep.Disorder))
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    None Sleep Apnea Insomnia
##   None         192           7        8
##   Sleep Apnea    6          65        6
##   Insomnia      21           6       63
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8556          
##                  95% CI : (0.8158, 0.8896)
##     No Information Rate : 0.5856          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7525          
##                                           
##  Mcnemar's Test P-Value : 0.1163          
## 
## Statistics by Class:
## 
##                      Class: None Class: Sleep Apnea Class: Insomnia
## Sensitivity               0.8767             0.8333          0.8182
## Specificity               0.9032             0.9595          0.9091
## Pos Pred Value            0.9275             0.8442          0.7000
## Neg Pred Value            0.8383             0.9562          0.9507
## Prevalence                0.5856             0.2086          0.2059
## Detection Rate            0.5134             0.1738          0.1684
## Detection Prevalence      0.5535             0.2059          0.2406
## Balanced Accuracy         0.8900             0.8964          0.8636

5. Visualisasi Hasil

Untuk mengetahui variabel mana yang berpengaruh terhadap gangguan tidur (Sleep Disorder), salah satu caranya adalah dengan melihat plot antara fungsi diskriminan.

plot(lda_model, col = as.integer(df[, c(numeric_cols, "Sleep.Disorder")]$Sleep.Disorder))

Dilihat dari plot diatas, dapat dikatakan model mampu mengelompokkan data dengan cukup baik walaupun terdapat overlap pada ketiga kategori. Dapat dilihat pula fungsi diskriminan LD1 berperan besar dalam membedakan antara kategori Sleep Disorder, sedangkan fungsi diskriminan LD2 tidak berperan besar dalam membedakan kategori Sleep Disorder.

E. Multinomial Logistic Regression

1. Pembentukan Model

if (!require(nnet)) install.packages("nnet")

## Loading required package: nnet

library(nnet)

model <- multinom(Sleep.Disorder ~ Age + Gender_Male + Occ_Healthcare + Occ_Public_Service + Occ_STEM_Technical + BMI.Category_Overweight + Sleep.Duration + Physical.Activity.Level + Stress.Level + Heart.Rate + Daily.Steps + BP_Systolic, data = df)

## # weights:  42 (26 variable)
## initial  value 410.880996 
## iter  10 value 226.286744
## iter  20 value 138.268814
## iter  30 value 132.765309
## iter  40 value 124.569961
## iter  50 value 123.838692
## final  value 123.761033 
## converged

summary(model)

## Call:
## multinom(formula = Sleep.Disorder ~ Age + Gender_Male + Occ_Healthcare + 
##     Occ_Public_Service + Occ_STEM_Technical + BMI.Category_Overweight + 
##     Sleep.Duration + Physical.Activity.Level + Stress.Level + 
##     Heart.Rate + Daily.Steps + BP_Systolic, data = df)
## 
## Coefficients:
##             (Intercept)        Age Gender_Male Occ_Healthcare
## Sleep Apnea   -96.11430 -0.2312524  -2.0840663      -2.136957
## Insomnia      -19.70363  0.1894021  -0.2624637      -5.113922
##             Occ_Public_Service Occ_STEM_Technical BMI.Category_Overweight
## Sleep Apnea          -2.376695          -2.882580                3.105087
## Insomnia             -1.621694          -2.369454               -1.818253
##             Sleep.Duration Physical.Activity.Level Stress.Level  Heart.Rate
## Sleep Apnea       3.607737             -0.05500125     0.516683  0.30719542
## Insomnia         -2.724603              0.04208792     0.313132 -0.08586376
##               Daily.Steps BP_Systolic
## Sleep Apnea  0.0007353121   0.4154133
## Insomnia    -0.0013329768   0.3361679
## 
## Std. Errors:
##             (Intercept)        Age Gender_Male Occ_Healthcare
## Sleep Apnea 0.002914060 0.05162050 0.008273096    0.004494686
## Insomnia    0.002918059 0.04383869 0.007137535    0.017680286
##             Occ_Public_Service Occ_STEM_Technical BMI.Category_Overweight
## Sleep Apnea         0.01095502        0.004210419              0.02202583
## Insomnia            0.01804571        0.014221082              0.03066443
##             Sleep.Duration Physical.Activity.Level Stress.Level Heart.Rate
## Sleep Apnea     0.07299044              0.02265407   0.09326578 0.07485071
## Insomnia        0.06437371              0.01875402   0.08458350 0.05845578
##              Daily.Steps BP_Systolic
## Sleep Apnea 0.0002303539  0.05500110
## Insomnia    0.0002636961  0.04075037
## 
## Residual Deviance: 247.5221 
## AIC: 299.5221

2. Uji Signifikansi Variabel

a. Uji Serentak

model_null <- multinom(Sleep.Disorder ~ 1, data = df, trace = FALSE)

anova_result <- anova(model_null, model, test = "Chisq")
print(anova_result)

## Likelihood ratio tests of Multinomial Models
## 
## Response: Sleep.Disorder
##                                                                                                                                                                                                       Model
## 1                                                                                                                                                                                                         1
## 2 Age + Gender_Male + Occ_Healthcare + Occ_Public_Service + Occ_STEM_Technical + BMI.Category_Overweight + Sleep.Duration + Physical.Activity.Level + Stress.Level + Heart.Rate + Daily.Steps + BP_Systolic
##   Resid. df Resid. Dev   Test    Df LR stat. Pr(Chi)
## 1       746   722.3373                              
## 2       722   247.5221 1 vs 2    24 474.8152       0

Berdasarkan uji Likelihood Ratio, terdapat perbedaan yang signifikan antara model hanya dengan intercept dan model penuh (χ²(24) = 474.81, p < 0.001). Hal ini menunjukkan bahwa variabel-variabel prediktor secara bersama-sama berkontribusi signifikan dalam memprediksi gangguan tidur.

b. Uji Parsial

model_summary <- summary(model)

coefs <- model_summary$coefficients
std_err <- model_summary$standard.errors

z_values <- coefs / std_err
p_values <- 2 * (1 - pnorm(abs(z_values)))

print(round(p_values, 4))

##             (Intercept) Age Gender_Male Occ_Healthcare Occ_Public_Service
## Sleep Apnea           0   0           0              0                  0
## Insomnia              0   0           0              0                  0
##             Occ_STEM_Technical BMI.Category_Overweight Sleep.Duration
## Sleep Apnea                  0                       0              0
## Insomnia                     0                       0              0
##             Physical.Activity.Level Stress.Level Heart.Rate Daily.Steps
## Sleep Apnea                  0.0152        0e+00     0.0000      0.0014
## Insomnia                     0.0248        2e-04     0.1419      0.0000
##             BP_Systolic
## Sleep Apnea           0
## Insomnia              0

Berdasarkan uji parsial, ditemukan bahwa sebagian besar variabel menghasilkan p-value < 0.05 yang berarti sebagian besar prediktor berkontribusi signifikan secara statistik terhadap probabilitas individu mengalami gangguan tidur, kecuali variabel Heart Rate pada kelas Insomnia (p = 0.1419).

3. Akurasi

mlr_pred <- predict(model, newdata = df)

library(caret)
conf_matrix <- confusionMatrix(mlr_pred, df$Sleep.Disorder)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    None Sleep Apnea Insomnia
##   None         208           6        8
##   Sleep Apnea    5          69        4
##   Insomnia       6           3       65
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9144          
##                  95% CI : (0.8814, 0.9407)
##     No Information Rate : 0.5856          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8494          
##                                           
##  Mcnemar's Test P-Value : 0.9146          
## 
## Statistics by Class:
## 
##                      Class: None Class: Sleep Apnea Class: Insomnia
## Sensitivity               0.9498             0.8846          0.8442
## Specificity               0.9097             0.9696          0.9697
## Pos Pred Value            0.9369             0.8846          0.8784
## Neg Pred Value            0.9276             0.9696          0.9600
## Prevalence                0.5856             0.2086          0.2059
## Detection Rate            0.5561             0.1845          0.1738
## Detection Prevalence      0.5936             0.2086          0.1979
## Balanced Accuracy         0.9297             0.9271          0.9069

4. Interpretasi (Odds Ratio)

data.frame(exp(summary(model)$coefficients))

##             X.Intercept.       Age Gender_Male Occ_Healthcare
## Sleep Apnea 1.811709e-42 0.7935392   0.1244232    0.118013450
## Insomnia    2.772189e-09 1.2085268   0.7691543    0.006012455
##             Occ_Public_Service Occ_STEM_Technical BMI.Category_Overweight
## Sleep Apnea         0.09285694         0.05599012               22.311167
## Insomnia            0.19756379         0.09353182                0.162309
##             Sleep.Duration Physical.Activity.Level Stress.Level Heart.Rate
## Sleep Apnea    36.88248465                0.946484     1.676458  1.3596066
## Insomnia        0.06557226                1.042986     1.367702  0.9177193
##             Daily.Steps BP_Systolic
## Sleep Apnea   1.0007356    1.514997
## Insomnia      0.9986679    1.399574

Berikut adalah interpretasi berdasarkan hasil odds ratio:

A. Sleep Apnea

Individu yang bekerja di bidang Occ_Healthcare (Doctor, Nurse), Occ_Public_Service (Lawyer, Teacher), dan Occ_STEM_Technical (Engineer, Software Engineer, Scientist) memiliki odds mengalami Sleep Apnea yang 90% lebih rendah dibanding pekerjaan Occ_Business_Sales (Accountant, Manager, SalesRepresentative, Salesperson) yang menjadi baseline.
Berdasarkan kondisi kesehatan, individu dengan BMI Overweight meningkatkan risiko Sleep Apnea sebesar 22 kali dibanding BMI Normal. Individu dengan Heart Rate dan BP_Systolic tinggi memiliki odds sekitar 35% dan 51% lebih tinggi mengalami Sleep Apnea.
Berdasarkan gaya hidup dan perilaku, individu dengan Physical Activity Level tinggi memiliki odds mengalami Sleep Apnea yang 6% lebih rendah. Individu dengan Stress Level tinggi memiliki odds sebesar 67% lebih tinggi mengalami Sleep Apnea.

B. Insomnia

Individu yang bekerja di bidang Occ_Healthcare (Doctor, Nurse), Occ_Public_Service (Lawyer, Teacher), dan Occ_STEM_Technical (Engineer, Software Engineer, Scientist) memiliki odds mengalami Insomnia sekitar 80-99% lebih rendah dibanding pekerjaan Occ_Business_Sales (Accountant, Manager, Sales Representative, Salesperson) yang menjadi baseline.
Berdasarkan kondisi kesehatan, individu dengan BMI Overweight memiliki odds mengalami Insomnia lebih rendah sebesar 84% dibanding BMI Normal. Individu dengan BP_Systolic tinggi memiliki odds sekitar 39% lebih tinggi mengalami Insomnia.
Berdasarkan gaya hidup dan perilaku, individu dengan Sleep Duration tinggi memiliki odds mengalami Insomnia yang 94% lebih rendah. Individu dengan Stress Level tinggi memiliki odds sebesar 36% lebih tinggi mengalami Insomnia.

Klasifikasi Gangguan Tidur Berdasarkan Faktor Gaya Hidup Menggunakan Analisis Diskriminan dan Regresi Multinomial

1. Ardina Shaifa R (23031554027)

2. Catrin Hony Oktavia M (23031554076)

3. Devina Gusriani (23031554208)

4. Gisella Serafina L M (23031554215)

2025-05-23

A. Informasi Dataset

Statistik Deskriptif

Nilai Unik Pada Kolom Kategorikal

Cek Missing Value

B. Preprocessing Dataset

1. Hapus kolom yang tidak diperlukan

2. Menangani data inkonsisten

3. Memisahkan kolom `Blood Pressure` menjadi `BP_Systolic` dan `BP_Diastolic`

4. Cek dan Handling Outlier

5. One Hot Encoding pada Kolom Kategorikal (Dummy Variabel)

6. Unordered Factorization pada label `Sleep Disorder`

C. Exploratory Data Analysis

Distribusi Sleep Disorder

Heatmap Korelasi

Visualisasi MDS

Dataset Bersih Setelah Preprocessing

Statistik Deskriptif Setelah Preprocessing

D. Linear Discriminant Analysis

1. Uji Asumsi

a. Uji Asumsi Normal Multivariat

b. Uji Asumsi Kehomogenan Varians

2. Uji Signifikansi

a. Uji Signifikansi Variabel Secara Keseluruhan

b. Uji Signifikansi Per Variabel

3. Pembentukan Model

4. Akurasi

5. Visualisasi Hasil

E. Multinomial Logistic Regression

1. Pembentukan Model

2. Uji Signifikansi Variabel

a. Uji Serentak

b. Uji Parsial

3. Akurasi

4. Interpretasi (Odds Ratio)

Klasifikasi Gangguan Tidur Berdasarkan Faktor Gaya Hidup Menggunakan Analisis Diskriminan dan Regresi Multinomial

1. Ardina Shaifa R (23031554027)

2. Catrin Hony Oktavia M (23031554076)

3. Devina Gusriani (23031554208)

4. Gisella Serafina L M (23031554215)

2025-05-23

A. Informasi Dataset

Statistik Deskriptif

Nilai Unik Pada Kolom Kategorikal

Cek Missing Value

B. Preprocessing Dataset

1. Hapus kolom yang tidak diperlukan

2. Menangani data inkonsisten

3. Memisahkan kolom Blood Pressure menjadi BP_Systolic dan BP_Diastolic

4. Cek dan Handling Outlier

5. One Hot Encoding pada Kolom Kategorikal (Dummy Variabel)

6. Unordered Factorization pada label Sleep Disorder

C. Exploratory Data Analysis

Distribusi Sleep Disorder

Heatmap Korelasi

Visualisasi MDS

Dataset Bersih Setelah Preprocessing

Statistik Deskriptif Setelah Preprocessing

D. Linear Discriminant Analysis

1. Uji Asumsi

a. Uji Asumsi Normal Multivariat

b. Uji Asumsi Kehomogenan Varians

2. Uji Signifikansi

a. Uji Signifikansi Variabel Secara Keseluruhan

b. Uji Signifikansi Per Variabel

3. Pembentukan Model

4. Akurasi

5. Visualisasi Hasil

E. Multinomial Logistic Regression

1. Pembentukan Model

2. Uji Signifikansi Variabel

a. Uji Serentak

b. Uji Parsial

3. Akurasi

4. Interpretasi (Odds Ratio)

3. Memisahkan kolom `Blood Pressure` menjadi `BP_Systolic` dan `BP_Diastolic`

6. Unordered Factorization pada label `Sleep Disorder`