Dataset yang digunakan bersumber dari Kaggle dan berisi informasi gaya hidup dan kesehatan individu, seperti aktivitas fisik, tekanan darah, serta gangguan tidur. Tujuan analisis ini adalah mengklasifikasi jenis gangguan tidur berdasarkan faktor gaya hidup.
library(DT)
## Warning: package 'DT' was built under R version 4.4.3
df <- read.csv("Sleep_health_and_lifestyle_dataset.csv")
datatable(df)
library(psych)
## Warning: package 'psych' was built under R version 4.4.3
describe(df)
## vars n mean sd median trimmed mad min
## Person.ID 1 374 187.50 108.11 187.5 187.50 138.62 1.0
## Gender* 2 374 1.51 0.50 2.0 1.51 0.00 1.0
## Age 3 374 42.18 8.67 43.0 41.84 10.38 27.0
## Occupation* 4 374 4.77 3.06 4.0 4.47 2.97 1.0
## Sleep.Duration 5 374 7.13 0.80 7.2 7.12 1.04 5.8
## Quality.of.Sleep 6 374 7.31 1.20 7.0 7.32 1.48 4.0
## Physical.Activity.Level 7 374 59.17 20.83 60.0 58.97 22.24 30.0
## Stress.Level 8 374 5.39 1.77 5.0 5.36 2.97 3.0
## BMI.Category* 9 374 2.30 1.43 1.0 2.25 0.00 1.0
## Blood.Pressure* 10 374 14.11 7.10 16.0 14.48 8.90 1.0
## Heart.Rate 11 374 70.17 4.14 70.0 69.74 2.97 65.0
## Daily.Steps 12 374 6816.84 1617.92 7000.0 6732.67 1482.60 3000.0
## Sleep.Disorder* 13 374 2.00 0.64 2.0 2.00 0.00 1.0
## max range skew kurtosis se
## Person.ID 374.0 373.0 0.00 -1.21 5.59
## Gender* 2.0 1.0 -0.02 -2.00 0.03
## Age 59.0 32.0 0.26 -0.92 0.45
## Occupation* 11.0 10.0 0.74 -0.53 0.16
## Sleep.Duration 8.5 2.7 0.04 -1.29 0.04
## Quality.of.Sleep 9.0 5.0 -0.21 -0.77 0.06
## Physical.Activity.Level 90.0 60.0 0.07 -1.27 1.08
## Stress.Level 8.0 5.0 0.15 -1.33 0.09
## BMI.Category* 4.0 3.0 0.28 -1.85 0.07
## Blood.Pressure* 25.0 24.0 -0.18 -0.96 0.37
## Heart.Rate 86.0 21.0 1.22 2.21 0.21
## Daily.Steps 10000.0 7000.0 0.18 -0.42 83.66
## Sleep.Disorder* 3.0 2.0 0.00 -0.60 0.03
cat_cols <- sapply(df, function(x) is.factor(x) || is.character(x))
for (col_name in names(df)[cat_cols]) {
cat("Kolom:", col_name, "\n")
print(table(df[[col_name]]))
cat("\n---------------------\n\n")
}
## Kolom: Gender
##
## Female Male
## 185 189
##
## ---------------------
##
## Kolom: Occupation
##
## Accountant Doctor Engineer
## 37 71 63
## Lawyer Manager Nurse
## 47 1 73
## Sales Representative Salesperson Scientist
## 2 32 4
## Software Engineer Teacher
## 4 40
##
## ---------------------
##
## Kolom: BMI.Category
##
## Normal Normal Weight Obese Overweight
## 195 21 10 148
##
## ---------------------
##
## Kolom: Blood.Pressure
##
## 115/75 115/78 117/76 118/75 118/76 119/77 120/80 121/79 122/80 125/80 125/82
## 32 2 2 2 1 2 45 1 1 65 4
## 126/83 128/84 128/85 129/84 130/85 130/86 131/86 132/87 135/88 135/90 139/91
## 2 2 3 2 99 2 2 3 2 27 2
## 140/90 140/95 142/92
## 4 65 2
##
## ---------------------
##
## Kolom: Sleep.Disorder
##
## Insomnia None Sleep Apnea
## 77 219 78
##
## ---------------------
missing_counts <- colSums(is.na(df))
print(missing_counts)
## Person.ID Gender Age
## 0 0 0
## Occupation Sleep.Duration Quality.of.Sleep
## 0 0 0
## Physical.Activity.Level Stress.Level BMI.Category
## 0 0 0
## Blood.Pressure Heart.Rate Daily.Steps
## 0 0 0
## Sleep.Disorder
## 0
df$Person.ID <- NULL
Pada kolom BMI Category
terdapat nilai yang tidak
konsisten, yaitu Normal
dan Normal Weight
,
serta Overweight
dan Obese
. Nilai
Normal Weight
diubah menjadi Normal
, dan nilai
Obese
diubah menjadi Overweight
agar
konsisten.
df$BMI.Category <- ifelse(df$BMI.Category == "Normal Weight", "Normal", df$BMI.Category)
df$BMI.Category <- ifelse(df$BMI.Category == "Obese", "Overweight", df$BMI.Category)
Blood Pressure
menjadi
BP_Systolic
dan BP_Diastolic
Tipe data pada kolom Blood Pressure
merupakan
kategorikal yang dimana berupa angka untuk merepresentasikan tekanan
darah dalam sistolik dan diastolik. Pada data baris pertama yaitu
126/83
, dimana nilai 126 adalah nilai tekanan darah
sistolik dan nilai 83 adalah nilai tekanan darah diastolik. Nilai ini
kemudian akan dipecah menjadi kolom BP_Systolic
dan
BP_Diastolic
agar menjadi tipe data numerik.
bp_split <- strsplit(as.character(df$Blood.Pressure), "/")
df$BP_Systolic <- as.numeric(sapply(bp_split, `[`, 1))
df$BP_Diastolic <- as.numeric(sapply(bp_split, `[`, 2))
df$Blood.Pressure <- NULL
sleep_col <- df$Sleep.Disorder
df$Sleep.Disorder <- NULL
df$Sleep.Disorder <- sleep_col
numeric_cols <- c("Age", "Sleep.Duration", "Quality.of.Sleep", "Physical.Activity.Level", "Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic", "BP_Diastolic")
par(mfrow = c(2, ceiling(length(numeric_cols) / 2)))
for (col in numeric_cols) {
boxplot(df[[col]], main = col, col = "skyblue", horizontal = TRUE)
}
par(mfrow = c(1,1))
detect_outliers <- function(x) {
z <- scale(x)
which(abs(z) > 3)
}
outlier_index <- list()
for (var in numeric_cols) {
outlier_index[[var]] <- detect_outliers(df[[var]])
cat(var, ":", length(outlier_index[[var]]), "outlier\n")
}
## Age : 0 outlier
## Sleep.Duration : 0 outlier
## Quality.of.Sleep : 0 outlier
## Physical.Activity.Level : 0 outlier
## Stress.Level : 0 outlier
## Heart.Rate : 9 outlier
## Daily.Steps : 0 outlier
## BP_Systolic : 0 outlier
## BP_Diastolic : 0 outlier
Outlier diimputasi menggunakan nilai batas bawah atau batas atas.
replace_outlier_with_bound <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}
numeric_cols <- c("Age", "Sleep.Duration", "Quality.of.Sleep", "Physical.Activity.Level", "Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic", "BP_Diastolic")
df[numeric_cols] <- lapply(df[numeric_cols], replace_outlier_with_bound)
detect_outliers <- function(x) {
z <- scale(x)
which(abs(z) > 3)
}
outlier_index <- list()
for (var in numeric_cols) {
outlier_index[[var]] <- detect_outliers(df[[var]])
cat(var, ":", length(outlier_index[[var]]), "outlier\n")
}
## Age : 0 outlier
## Sleep.Duration : 0 outlier
## Quality.of.Sleep : 0 outlier
## Physical.Activity.Level : 0 outlier
## Stress.Level : 0 outlier
## Heart.Rate : 0 outlier
## Daily.Steps : 0 outlier
## BP_Systolic : 0 outlier
## BP_Diastolic : 0 outlier
Beberapa kolom kategorikal perlu diubah menjadi representasi numerik
agar bisa digunakan dalam model Multinomial Regression. Proses One Hot
Encoding dilakukan pada kolom Gender
,
Occupation
, dan BMI Category
, dengan
pendekatan baseline agar menghindari dummy variable trap. Berikut
baseline yang digunakan:
Gender
: FemaleOccupation
: ManagerBMI Category
: Normaldf$Gender <- factor(df$Gender, levels = c("Female", "Male"))
df$Occupation <- factor(df$Occupation, levels = c("Manager", sort(unique(df$Occupation))[sort(unique(df$Occupation)) != "Manager"]))
df$BMI.Category <- factor(df$BMI.Category, levels = c("Normal", "Overweight"))
gender_dummies <- model.matrix(~ Gender, data = df)[, -1, drop = FALSE]
occupation_dummies <- model.matrix(~ Occupation, data = df)[, -1, drop = FALSE]
bmi_dummies <- model.matrix(~ BMI.Category, data = df)[, -1, drop = FALSE]
colnames(gender_dummies) <- gsub("Gender", "Gender_", colnames(gender_dummies))
colnames(occupation_dummies) <- gsub("Occupation", "Occupation_", colnames(occupation_dummies))
colnames(bmi_dummies) <- gsub("BMI.Category", "BMI.Category_", colnames(bmi_dummies))
colnames(gender_dummies) <- gsub("[() ]", "", colnames(gender_dummies))
colnames(occupation_dummies) <- gsub("[() ]", "", colnames(occupation_dummies))
colnames(bmi_dummies) <- gsub("[() ]", "", colnames(bmi_dummies))
df$Gender <- NULL
df$Occupation <- NULL
df$BMI.Category <- NULL
df <- cbind(df, gender_dummies, occupation_dummies, bmi_dummies)
Untuk menghindari sparsity dan menjaga balance antar kategori, pekerjaan dikelompokkan menjadi bidang kerja yang lebih umum.
Occ_Healthcare
: Doctor, NurseOcc_Public_Service
: Lawyer, TeacherOcc_STEM_Technical
: Engineer, Scientist, Software
EngineerOcc_Business_Sales
(baseline): Accountant, Manager,
SalesRepresentative, Salespersondf$Occ_Healthcare <- with(df, Occupation_Doctor + Occupation_Nurse)
df$Occ_Public_Service <- with(df, Occupation_Lawyer + Occupation_Teacher)
df$Occ_STEM_Technical <- with(df,
Occupation_Engineer + Occupation_Scientist + Occupation_SoftwareEngineer
)
df <- df[, !grepl("^Occupation_", names(df))]
# atur urutan kolom
new_order <- c(
"Age",
"Gender_Male",
grep("^Occ_", names(df), value = TRUE),
grep("^BMI.Category_", names(df), value = TRUE),
"Sleep.Duration",
"Quality.of.Sleep",
"Physical.Activity.Level",
"Stress.Level",
"Heart.Rate",
"Daily.Steps",
"BP_Systolic",
"BP_Diastolic",
"Sleep.Disorder"
)
cols_existing <- names(df)
cols_to_keep <- intersect(new_order, cols_existing)
cols_leftover <- setdiff(cols_existing, cols_to_keep)
df <- df[, c(cols_to_keep, cols_leftover)]
Sleep Disorder
Label kelas Sleep Disorder
dikonversi dari bentuk string
ke dalam tipe data faktor (unordered) agar algoritma klasifikasi seperti
Analisis Diskriminan Linear (LDA) dan Regresi Multinomial dapat
mengenali bahwa target merupakan kelas diskrit (nominal), bukan data
numerik berurutan (ordinal).
df$Sleep.Disorder <- factor(df$Sleep.Disorder, levels = c("None", "Sleep Apnea", "Insomnia"))
table(df$Sleep.Disorder)
##
## None Sleep Apnea Insomnia
## 219 78 77
barplot(table(df$Sleep.Disorder), col = "steelblue", main = "Distribusi Sleep Disorder")
library(corrplot)
## corrplot 0.95 loaded
exclude_prefixes <- c("Gender_", "Occ_", "BMI.Category_")
numeric_df <- df[sapply(df, is.numeric)]
numeric_df <- numeric_df[ , !grepl(paste0("^(", paste(exclude_prefixes, collapse = "|"), ")"), names(numeric_df))]
corrplot::corrplot(cor(numeric_df), method = "color", tl.cex = 1, addCoef.col = "black", number.cex = 0.7)
Berdasarkan hasil visualisasi korelasi, ditemukan bahwa
BP_Systolic
dan BP_Diastolic
memiliki korelasi
tinggi (>0.85). Untuk menghindari masalah multikolinearitas, variabel
BP_Diastolic
dihapus dari dataset. Begitu juga dengan
Sleep.Duration
dan Quality.of.Sleep
yang
berkorelasi tinggi, variabel Quality.of.Sleep
dihapus dari
dataset.
df$BP_Diastolic <- NULL
df$Quality.of.Sleep <- NULL
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
numeric_cols <- c("Age", "Sleep.Duration", "Physical.Activity.Level",
"Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic")
data_numeric <- df[, numeric_cols]
dist_matrix <- dist(scale(data_numeric))
mds_result <- cmdscale(dist_matrix, k = 2)
mds_df <- data.frame(Dim1 = mds_result[,1],
Dim2 = mds_result[,2],
Sleep.Disorder = df$Sleep.Disorder)
ggplot(mds_df, aes(x = Dim1, y = Dim2, color = Sleep.Disorder)) +
geom_point(size = 2, alpha = 0.8) +
labs(title = "MDS Visualization (Euclidean Distance)",
x = "Dimension 1", y = "Dimension 2") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
datatable(df)
library(psych)
describe(df)
## vars n mean sd median trimmed mad min
## Age 1 374 42.18 8.67 43.0 41.84 10.38 27.0
## Gender_Male 2 374 0.51 0.50 1.0 0.51 0.00 0.0
## Occ_Healthcare 3 374 0.39 0.49 0.0 0.36 0.00 0.0
## Occ_Public_Service 4 374 0.23 0.42 0.0 0.17 0.00 0.0
## Occ_STEM_Technical 5 374 0.19 0.39 0.0 0.11 0.00 0.0
## BMI.Category_Overweight 6 374 0.42 0.49 0.0 0.40 0.00 0.0
## Sleep.Duration 7 374 7.13 0.80 7.2 7.12 1.04 5.8
## Physical.Activity.Level 8 374 59.17 20.83 60.0 58.97 22.24 30.0
## Stress.Level 9 374 5.39 1.77 5.0 5.36 2.97 3.0
## Heart.Rate 10 374 69.97 3.57 70.0 69.74 2.97 65.0
## Daily.Steps 11 374 6816.84 1617.92 7000.0 6732.67 1482.60 3000.0
## BP_Systolic 12 374 128.55 7.75 130.0 128.78 7.41 115.0
## Sleep.Disorder* 13 374 1.62 0.81 1.0 1.53 0.00 1.0
## max range skew kurtosis se
## Age 59.0 32.0 0.26 -0.92 0.45
## Gender_Male 1.0 1.0 -0.02 -2.00 0.03
## Occ_Healthcare 1.0 1.0 0.47 -1.78 0.03
## Occ_Public_Service 1.0 1.0 1.26 -0.41 0.02
## Occ_STEM_Technical 1.0 1.0 1.58 0.48 0.02
## BMI.Category_Overweight 1.0 1.0 0.31 -1.91 0.03
## Sleep.Duration 8.5 2.7 0.04 -1.29 0.04
## Physical.Activity.Level 90.0 60.0 0.07 -1.27 1.08
## Stress.Level 8.0 5.0 0.15 -1.33 0.09
## Heart.Rate 78.0 13.0 0.47 -0.39 0.18
## Daily.Steps 10000.0 7000.0 0.18 -0.42 83.66
## BP_Systolic 142.0 27.0 -0.04 -0.91 0.40
## Sleep.Disorder* 3.0 2.0 0.79 -1.01 0.04
if (!require(MVN)) install.packages("MVN")
## Loading required package: MVN
## Warning: package 'MVN' was built under R version 4.4.3
library(MVN)
numeric_cols <- c("Age", "Sleep.Duration", "Physical.Activity.Level", "Stress.Level", "Heart.Rate", "Daily.Steps", "BP_Systolic")
mvn(df[numeric_cols], mvnTest = "hz", univariateTest = "AD", multivariatePlot = "qq")
## $multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 21.37807 0 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling Age 3.9372 <0.001 NO
## 2 Anderson-Darling Sleep.Duration 7.3040 <0.001 NO
## 3 Anderson-Darling Physical.Activity.Level 11.3729 <0.001 NO
## 4 Anderson-Darling Stress.Level 12.6410 <0.001 NO
## 5 Anderson-Darling Heart.Rate 9.1373 <0.001 NO
## 6 Anderson-Darling Daily.Steps 8.9070 <0.001 NO
## 7 Anderson-Darling BP_Systolic 9.0151 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max
## Age 374 42.184492 8.6731335 43.0 27.0 59.0
## Sleep.Duration 374 7.132086 0.7956567 7.2 5.8 8.5
## Physical.Activity.Level 374 59.171123 20.8308037 60.0 30.0 90.0
## Stress.Level 374 5.385027 1.7745264 5.0 3.0 8.0
## Heart.Rate 374 69.965241 3.5672581 70.0 65.0 78.0
## Daily.Steps 374 6816.844920 1617.9156791 7000.0 3000.0 10000.0
## BP_Systolic 374 128.553476 7.7481176 130.0 115.0 142.0
## 25th 75th Skew Kurtosis
## Age 35.25 50.0 0.25516254 -0.9248041
## Sleep.Duration 6.40 7.8 0.03725369 -1.2945281
## Physical.Activity.Level 45.00 75.0 0.07389048 -1.2744695
## Stress.Level 4.00 7.0 0.15309385 -1.3345700
## Heart.Rate 68.00 72.0 0.46555442 -0.3878657
## Daily.Steps 5600.00 8000.0 0.17684985 -0.4186421
## BP_Systolic 125.00 135.0 -0.03538326 -0.9088701
Hasil ini menunjukkan bahwa p-value<0.05 maka tolak H0, artinya populasi tidak berdistribusi normal multivariat, sehingga asumsi normal multivariat tidak terpenuhi
library(biotools)
## Warning: package 'biotools' was built under R version 4.4.3
## Loading required package: MASS
## ---
## biotools version 4.3
box_m_result <- boxM(df[, numeric_cols], grouping = df$Sleep.Disorder)
print(box_m_result)
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: df[, numeric_cols]
## Chi-Sq (approx.) = 833.06, df = 56, p-value < 2.2e-16
Hasil ini menunjukkan bahwa p-value<0.05 maka tolak H0, artinya varians tidak homogen, sehingga asumsi homogenitas varians tidak terpenuhi
manova_model <- manova(as.matrix(df[, numeric_cols]) ~ Sleep.Disorder, data = df)
summary(manova_model, test = "Wilks")
## Df Wilks approx F num Df den Df Pr(>F)
## Sleep.Disorder 2 0.26577 49.001 14 730 < 2.2e-16 ***
## Residuals 371
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p-value < 0.05 → Tolak H0: Terdapat paling tidak satu kombinasi
linier dari variabel numerik yang dapat membedakan antar kelas
Sleep Disorder
.
anova_results <- sapply(numeric_cols, function(x) {
model <- aov(as.formula(paste(x, "~ Sleep.Disorder")), data = df)
summary(model)[[1]][["Pr(>F)"]][1]
})
anova_df <- data.frame(p_value = anova_results)
anova_df
## p_value
## Age 8.852105e-23
## Sleep.Duration 1.626151e-13
## Physical.Activity.Level 6.306184e-18
## Stress.Level 1.520459e-03
## Heart.Rate 3.510225e-13
## Daily.Steps 7.942397e-11
## BP_Systolic 1.198027e-62
Semua variabel memiliki p-value < 0.05, artinya terdapat perbedaan
rata-rata yang signifikan secara statistik untuk semua variabel numerik
berdasarkan kelompok Sleep Disorder
.
library(MASS)
lda_model <- lda(Sleep.Disorder ~ ., data = df[, c(numeric_cols, "Sleep.Disorder")])
print(lda_model)
## Call:
## lda(Sleep.Disorder ~ ., data = df[, c(numeric_cols, "Sleep.Disorder")])
##
## Prior probabilities of groups:
## None Sleep Apnea Insomnia
## 0.5855615 0.2085561 0.2058824
##
## Group means:
## Age Sleep.Duration Physical.Activity.Level Stress.Level
## None 39.03653 7.358447 57.94977 5.114155
## Sleep Apnea 49.70513 7.032051 74.79487 5.666667
## Insomnia 43.51948 6.589610 46.81818 5.870130
## Heart.Rate Daily.Steps BP_Systolic
## None 69.01826 6852.968 124.0457
## Sleep Apnea 72.44872 7619.231 137.7692
## Insomnia 70.14286 5901.299 132.0390
##
## Coefficients of linear discriminants:
## LD1 LD2
## Age 6.793810e-02 0.0182843850
## Sleep.Duration -8.130318e-01 -1.3462678312
## Physical.Activity.Level 2.210341e-03 0.0227071520
## Stress.Level -3.381003e-01 0.2983645430
## Heart.Rate 2.024337e-01 -0.3276790399
## Daily.Steps 3.755992e-05 -0.0007085623
## BP_Systolic 1.134453e-01 0.0002235590
##
## Proportion of trace:
## LD1 LD2
## 0.7977 0.2023
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
lda_pred <- predict(lda_model, newdata = df[, numeric_cols])
# Confusion matrix
conf_matrix <- confusionMatrix(as.factor(lda_pred$class), as.factor(df$Sleep.Disorder))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction None Sleep Apnea Insomnia
## None 192 7 8
## Sleep Apnea 6 65 6
## Insomnia 21 6 63
##
## Overall Statistics
##
## Accuracy : 0.8556
## 95% CI : (0.8158, 0.8896)
## No Information Rate : 0.5856
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7525
##
## Mcnemar's Test P-Value : 0.1163
##
## Statistics by Class:
##
## Class: None Class: Sleep Apnea Class: Insomnia
## Sensitivity 0.8767 0.8333 0.8182
## Specificity 0.9032 0.9595 0.9091
## Pos Pred Value 0.9275 0.8442 0.7000
## Neg Pred Value 0.8383 0.9562 0.9507
## Prevalence 0.5856 0.2086 0.2059
## Detection Rate 0.5134 0.1738 0.1684
## Detection Prevalence 0.5535 0.2059 0.2406
## Balanced Accuracy 0.8900 0.8964 0.8636
Untuk mengetahui variabel mana yang berpengaruh terhadap gangguan
tidur (Sleep Disorder
), salah satu caranya adalah dengan
melihat plot antara fungsi diskriminan.
plot(lda_model, col = as.integer(df[, c(numeric_cols, "Sleep.Disorder")]$Sleep.Disorder))
Dilihat dari plot diatas, dapat dikatakan model mampu mengelompokkan
data dengan cukup baik walaupun terdapat overlap pada ketiga kategori.
Dapat dilihat pula fungsi diskriminan LD1 berperan besar dalam
membedakan antara kategori Sleep Disorder
, sedangkan fungsi
diskriminan LD2 tidak berperan besar dalam membedakan kategori
Sleep Disorder
.
if (!require(nnet)) install.packages("nnet")
## Loading required package: nnet
library(nnet)
model <- multinom(Sleep.Disorder ~ Age + Gender_Male + Occ_Healthcare + Occ_Public_Service + Occ_STEM_Technical + BMI.Category_Overweight + Sleep.Duration + Physical.Activity.Level + Stress.Level + Heart.Rate + Daily.Steps + BP_Systolic, data = df)
## # weights: 42 (26 variable)
## initial value 410.880996
## iter 10 value 226.286744
## iter 20 value 138.268814
## iter 30 value 132.765309
## iter 40 value 124.569961
## iter 50 value 123.838692
## final value 123.761033
## converged
summary(model)
## Call:
## multinom(formula = Sleep.Disorder ~ Age + Gender_Male + Occ_Healthcare +
## Occ_Public_Service + Occ_STEM_Technical + BMI.Category_Overweight +
## Sleep.Duration + Physical.Activity.Level + Stress.Level +
## Heart.Rate + Daily.Steps + BP_Systolic, data = df)
##
## Coefficients:
## (Intercept) Age Gender_Male Occ_Healthcare
## Sleep Apnea -96.11430 -0.2312524 -2.0840663 -2.136957
## Insomnia -19.70363 0.1894021 -0.2624637 -5.113922
## Occ_Public_Service Occ_STEM_Technical BMI.Category_Overweight
## Sleep Apnea -2.376695 -2.882580 3.105087
## Insomnia -1.621694 -2.369454 -1.818253
## Sleep.Duration Physical.Activity.Level Stress.Level Heart.Rate
## Sleep Apnea 3.607737 -0.05500125 0.516683 0.30719542
## Insomnia -2.724603 0.04208792 0.313132 -0.08586376
## Daily.Steps BP_Systolic
## Sleep Apnea 0.0007353121 0.4154133
## Insomnia -0.0013329768 0.3361679
##
## Std. Errors:
## (Intercept) Age Gender_Male Occ_Healthcare
## Sleep Apnea 0.002914060 0.05162050 0.008273096 0.004494686
## Insomnia 0.002918059 0.04383869 0.007137535 0.017680286
## Occ_Public_Service Occ_STEM_Technical BMI.Category_Overweight
## Sleep Apnea 0.01095502 0.004210419 0.02202583
## Insomnia 0.01804571 0.014221082 0.03066443
## Sleep.Duration Physical.Activity.Level Stress.Level Heart.Rate
## Sleep Apnea 0.07299044 0.02265407 0.09326578 0.07485071
## Insomnia 0.06437371 0.01875402 0.08458350 0.05845578
## Daily.Steps BP_Systolic
## Sleep Apnea 0.0002303539 0.05500110
## Insomnia 0.0002636961 0.04075037
##
## Residual Deviance: 247.5221
## AIC: 299.5221
model_null <- multinom(Sleep.Disorder ~ 1, data = df, trace = FALSE)
anova_result <- anova(model_null, model, test = "Chisq")
print(anova_result)
## Likelihood ratio tests of Multinomial Models
##
## Response: Sleep.Disorder
## Model
## 1 1
## 2 Age + Gender_Male + Occ_Healthcare + Occ_Public_Service + Occ_STEM_Technical + BMI.Category_Overweight + Sleep.Duration + Physical.Activity.Level + Stress.Level + Heart.Rate + Daily.Steps + BP_Systolic
## Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
## 1 746 722.3373
## 2 722 247.5221 1 vs 2 24 474.8152 0
Berdasarkan uji Likelihood Ratio, terdapat perbedaan yang signifikan antara model hanya dengan intercept dan model penuh (χ²(24) = 474.81, p < 0.001). Hal ini menunjukkan bahwa variabel-variabel prediktor secara bersama-sama berkontribusi signifikan dalam memprediksi gangguan tidur.
model_summary <- summary(model)
coefs <- model_summary$coefficients
std_err <- model_summary$standard.errors
z_values <- coefs / std_err
p_values <- 2 * (1 - pnorm(abs(z_values)))
print(round(p_values, 4))
## (Intercept) Age Gender_Male Occ_Healthcare Occ_Public_Service
## Sleep Apnea 0 0 0 0 0
## Insomnia 0 0 0 0 0
## Occ_STEM_Technical BMI.Category_Overweight Sleep.Duration
## Sleep Apnea 0 0 0
## Insomnia 0 0 0
## Physical.Activity.Level Stress.Level Heart.Rate Daily.Steps
## Sleep Apnea 0.0152 0e+00 0.0000 0.0014
## Insomnia 0.0248 2e-04 0.1419 0.0000
## BP_Systolic
## Sleep Apnea 0
## Insomnia 0
Berdasarkan uji parsial, ditemukan bahwa sebagian besar variabel
menghasilkan p-value < 0.05 yang berarti sebagian besar prediktor
berkontribusi signifikan secara statistik terhadap probabilitas individu
mengalami gangguan tidur, kecuali variabel Heart Rate
pada
kelas Insomnia (p = 0.1419).
mlr_pred <- predict(model, newdata = df)
library(caret)
conf_matrix <- confusionMatrix(mlr_pred, df$Sleep.Disorder)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction None Sleep Apnea Insomnia
## None 208 6 8
## Sleep Apnea 5 69 4
## Insomnia 6 3 65
##
## Overall Statistics
##
## Accuracy : 0.9144
## 95% CI : (0.8814, 0.9407)
## No Information Rate : 0.5856
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8494
##
## Mcnemar's Test P-Value : 0.9146
##
## Statistics by Class:
##
## Class: None Class: Sleep Apnea Class: Insomnia
## Sensitivity 0.9498 0.8846 0.8442
## Specificity 0.9097 0.9696 0.9697
## Pos Pred Value 0.9369 0.8846 0.8784
## Neg Pred Value 0.9276 0.9696 0.9600
## Prevalence 0.5856 0.2086 0.2059
## Detection Rate 0.5561 0.1845 0.1738
## Detection Prevalence 0.5936 0.2086 0.1979
## Balanced Accuracy 0.9297 0.9271 0.9069
data.frame(exp(summary(model)$coefficients))
## X.Intercept. Age Gender_Male Occ_Healthcare
## Sleep Apnea 1.811709e-42 0.7935392 0.1244232 0.118013450
## Insomnia 2.772189e-09 1.2085268 0.7691543 0.006012455
## Occ_Public_Service Occ_STEM_Technical BMI.Category_Overweight
## Sleep Apnea 0.09285694 0.05599012 22.311167
## Insomnia 0.19756379 0.09353182 0.162309
## Sleep.Duration Physical.Activity.Level Stress.Level Heart.Rate
## Sleep Apnea 36.88248465 0.946484 1.676458 1.3596066
## Insomnia 0.06557226 1.042986 1.367702 0.9177193
## Daily.Steps BP_Systolic
## Sleep Apnea 1.0007356 1.514997
## Insomnia 0.9986679 1.399574
Berikut adalah interpretasi berdasarkan hasil odds ratio:
A. Sleep Apnea
Occ_Healthcare
(Doctor,
Nurse), Occ_Public_Service
(Lawyer, Teacher), dan
Occ_STEM_Technical
(Engineer, Software Engineer, Scientist)
memiliki odds mengalami Sleep Apnea yang 90% lebih rendah dibanding
pekerjaan Occ_Business_Sales
(Accountant, Manager,
SalesRepresentative, Salesperson) yang menjadi baseline.BMI Overweight
meningkatkan risiko Sleep Apnea sebesar 22
kali dibanding BMI Normal
. Individu dengan
Heart Rate
dan BP_Systolic
tinggi memiliki
odds sekitar 35% dan 51% lebih tinggi mengalami Sleep Apnea.Physical Activity Level
tinggi memiliki odds mengalami
Sleep Apnea yang 6% lebih rendah. Individu dengan
Stress Level
tinggi memiliki odds sebesar 67% lebih tinggi
mengalami Sleep Apnea.B. Insomnia
Occ_Healthcare
(Doctor,
Nurse), Occ_Public_Service
(Lawyer, Teacher), dan
Occ_STEM_Technical
(Engineer, Software Engineer, Scientist)
memiliki odds mengalami Insomnia sekitar 80-99% lebih rendah dibanding
pekerjaan Occ_Business_Sales
(Accountant, Manager, Sales
Representative, Salesperson) yang menjadi baseline.BMI Overweight
memiliki odds mengalami Insomnia lebih
rendah sebesar 84% dibanding BMI Normal
. Individu dengan
BP_Systolic
tinggi memiliki odds sekitar 39% lebih tinggi
mengalami Insomnia.Sleep Duration
tinggi memiliki odds mengalami Insomnia yang
94% lebih rendah. Individu dengan Stress Level
tinggi
memiliki odds sebesar 36% lebih tinggi mengalami Insomnia.