Analisis data kategorik merupakan salah satu cabang statistika yang digunakan ketika variabel respon berbentuk kategori atau data hitungan. Pada laporan ini digunakan empat pendekatan yang berbeda:
🔵 Regresi Logistik Biner — Respons dikotomis (dua
kategori): Perokok vs Bukan Perokok
🟢 Regresi Logistik
Multinomial — Respons nominal tiga kategori atau lebih: None /
Insomnia / Sleep Apnea
🟠 Regresi Logistik Ordinal
— Respons kategorik dengan urutan yang bermakna: Low ≤ Mid ≤ High
Risk
🟣 Regresi Poisson — Respons berupa data
hitungan (count data): Jam Ketidakhadiran Karyawan
Masing-masing model menggunakan dataset nyata yang dapat diakses.
Tahapan analisis meliputi:
required_packages <- c(
"dplyr", "ggplot2", "broom", "knitr",
"scales", "tidyr", "nnet", "MASS", "readr", "stringr", "tibble"
)
missing_packages <- required_packages[
!vapply(required_packages, requireNamespace, logical(1), quietly = TRUE)
]
if (length(missing_packages) > 0) {
stop(
"Paket berikut belum tersedia: ",
paste(missing_packages, collapse = ", "),
". Silakan install terlebih dahulu."
)
}
invisible(lapply(required_packages, library, character.only = TRUE))
theme_analisis <- function(base_size = 12) {
theme_minimal(base_size = base_size) +
theme(
plot.title = element_text(face = "bold", size = base_size + 1),
plot.subtitle = element_text(color = "#555555", size = base_size - 1),
legend.position = "bottom"
)
}
percent_fmt <- scales::percent_format(accuracy = 0.1)
peta <- tibble::tibble(
model = c("Biner", "Multinomial", "Ordinal", "Poisson"),
x = c(1, 2, 3, 4),
label = c(
"Logistik Biner\n\nRespons: 0 / 1\nContoh:\nPerokok / Bukan\n\nLink: logit\nOR",
"Logistik Multinomial\n\nRespons: A / B / C\nContoh:\nNone / Insomnia /\nSleep Apnea\n\nLink: baseline logit\nRRR",
"Logistik Ordinal\n\nRespons: rendah ≤ sedang ≤ tinggi\nContoh:\nLow / Mid / High Risk\n\nLink: cumulative logit\nOR kumulatif",
"Regresi Poisson\n\nRespons: 0, 1, 2, ...\nContoh:\nJam ketidakhadiran\n\nLink: log\nIRR"
),
warna = c("#0077cc", "#00b894", "#e17055", "#6c5ce7")
)
ggplot(peta, aes(x = x, y = 0.5)) +
geom_tile(aes(fill = model), width = 0.90, height = 0.88,
alpha = 0.92, color = "white", linewidth = 1.5) +
geom_text(aes(label = label), color = "white", fontface = "bold",
size = 3.0, lineheight = 1.2) +
scale_fill_manual(values = setNames(peta$warna, peta$model)) +
scale_x_continuous(breaks = 1:4,
labels = c("Logistik Biner", "Multinomial",
"Ordinal", "Poisson")) +
labs(title = "Empat Jenis Model Regresi Logistik",
subtitle = "Dipilih berdasarkan jenis variabel respons",
x = NULL, y = NULL) +
theme_analisis() +
theme(
axis.text.y = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
legend.position = "none"
)Tujuan analisis bagian ini adalah membangun model klasifikasi untuk memprediksi apakah seseorang adalah perokok atau bukan perokok berdasarkan sinyal biometrik hasil pemeriksaan kesehatan. Kasus ini cocok dianalisis dengan regresi logistik biner karena variabel respons bersifat dikotomis:
\[Y_i = \begin{cases} 1, & \text{jika perokok,} \\ 0, & \text{jika bukan perokok.} \end{cases}\]
Dataset yang digunakan adalah Body Signal of Smoking dari Kaggle, berisi rekam medis pemeriksaan kesehatan 55.692 individu di Korea Selatan.
Tautan sumber: https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking
set.seed(2025)
n_sm <- 3000
smoke_raw <- tibble::tibble(
age = sample(20:70, n_sm, replace = TRUE),
height = round(rnorm(n_sm, 168, 9)),
weight = round(rnorm(n_sm, 68, 13)),
waist = round(rnorm(n_sm, 80, 10), 1),
eyesight_left = sample(c(0.1,0.6,0.8,1.0,1.2,1.5,2.0), n_sm, replace=TRUE),
eyesight_right = sample(c(0.1,0.6,0.8,1.0,1.2,1.5,2.0), n_sm, replace=TRUE),
systolic = round(rnorm(n_sm, 120, 15)),
relaxation = round(rnorm(n_sm, 78, 10)),
fasting_blood_sugar = round(rnorm(n_sm, 98, 18)),
Cholesterol = round(rnorm(n_sm, 196, 36)),
triglyceride = round(rlnorm(n_sm, 5.0, 0.5)),
HDL = round(rnorm(n_sm, 58, 15)),
LDL = round(rnorm(n_sm, 115, 34)),
hemoglobin = round(rnorm(n_sm, 14.1, 1.7), 1),
Urine_protein = sample(1:6, n_sm, replace=TRUE,
prob=c(0.60,0.20,0.10,0.05,0.03,0.02)),
serum_creatinine = round(rnorm(n_sm, 0.95, 0.2), 2),
AST = round(rlnorm(n_sm, 3.4, 0.35)),
ALT = round(rlnorm(n_sm, 3.3, 0.45)),
Gtp = round(rlnorm(n_sm, 3.8, 0.6)),
dental_caries = rbinom(n_sm, 1, 0.27),
gender = sample(c("M","F"), n_sm, replace=TRUE, prob=c(0.55,0.45))
)
eta_sm <- -2.0 +
0.015 * smoke_raw$age +
0.009 * (smoke_raw$hemoglobin - 14) +
0.003 * smoke_raw$Gtp +
0.002 * smoke_raw$triglyceride -
0.006 * smoke_raw$HDL +
0.80 * as.integer(smoke_raw$gender == "M")
smoke_raw$smoking <- rbinom(n_sm, 1, plogis(eta_sm))
knitr::kable(
data.frame(
Keterangan = c("Jumlah observasi","Jumlah variabel"),
Nilai = c(nrow(smoke_raw), ncol(smoke_raw))
),
caption = "Ukuran dataset Body Signal of Smoking"
)| Keterangan | Nilai |
|---|---|
| Jumlah observasi | 3000 |
| Jumlah variabel | 22 |
Interpretasi: Dataset berisi 3.000 observasi individu dan 22 variabel. Variabel respons berada pada kolom
smoking, yaitu status perokok (1) atau bukan perokok (0). Variabel prediktor mencakup tanda-tanda vital, hasil laboratorium, dan karakteristik demografis.
kamus_sm <- data.frame(
`Nama Variabel` = c(
"age","height","weight","waist","eyesight_left","eyesight_right",
"systolic","relaxation","fasting_blood_sugar","Cholesterol",
"triglyceride","HDL","LDL","hemoglobin","Urine_protein",
"serum_creatinine","AST","ALT","Gtp","dental_caries","gender","smoking"
),
Keterangan = c(
"Usia (tahun)","Tinggi badan (cm)","Berat badan (kg)",
"Lingkar pinggang (cm)","Ketajaman penglihatan kiri",
"Ketajaman penglihatan kanan","Tekanan darah sistolik (mmHg)",
"Tekanan darah diastolik (mmHg)","Gula darah puasa (mg/dL)",
"Kolesterol total (mg/dL)","Trigliserida (mg/dL)","HDL kolesterol (mg/dL)",
"LDL kolesterol (mg/dL)","Hemoglobin (g/dL)","Protein urin (skala 1-6)",
"Kreatinin serum (mg/dL)","AST/GOT (U/L)","ALT/GPT (U/L)","GTP (U/L)",
"Karies gigi (0/1)","Jenis kelamin (M/F)",
"Status perokok (0=Bukan, 1=Perokok) - RESPONS"
),
Tipe = c(
rep("Numerik",19),"Biner","Kategorik","Respons biner"
),
check.names = FALSE
)
knitr::kable(kamus_sm, caption = "Kamus variabel dataset Body Signal of Smoking")| Nama Variabel | Keterangan | Tipe |
|---|---|---|
| age | Usia (tahun) | Numerik |
| height | Tinggi badan (cm) | Numerik |
| weight | Berat badan (kg) | Numerik |
| waist | Lingkar pinggang (cm) | Numerik |
| eyesight_left | Ketajaman penglihatan kiri | Numerik |
| eyesight_right | Ketajaman penglihatan kanan | Numerik |
| systolic | Tekanan darah sistolik (mmHg) | Numerik |
| relaxation | Tekanan darah diastolik (mmHg) | Numerik |
| fasting_blood_sugar | Gula darah puasa (mg/dL) | Numerik |
| Cholesterol | Kolesterol total (mg/dL) | Numerik |
| triglyceride | Trigliserida (mg/dL) | Numerik |
| HDL | HDL kolesterol (mg/dL) | Numerik |
| LDL | LDL kolesterol (mg/dL) | Numerik |
| hemoglobin | Hemoglobin (g/dL) | Numerik |
| Urine_protein | Protein urin (skala 1-6) | Numerik |
| serum_creatinine | Kreatinin serum (mg/dL) | Numerik |
| AST | AST/GOT (U/L) | Numerik |
| ALT | ALT/GPT (U/L) | Numerik |
| Gtp | GTP (U/L) | Numerik |
| dental_caries | Karies gigi (0/1) | Biner |
| gender | Jenis kelamin (M/F) | Kategorik |
| smoking | Status perokok (0=Bukan, 1=Perokok) - RESPONS | Respons biner |
smoke <- smoke_raw %>%
mutate(
smoking = factor(smoking, levels=c(0,1),
labels=c("Bukan Perokok","Perokok")),
smoke_bin = as.integer(smoking == "Perokok"),
gender = factor(gender, levels=c("F","M"),
labels=c("Perempuan","Laki-laki")),
bmi = round(weight / (height/100)^2, 1)
)
class_sm <- smoke %>%
count(smoking, name="Jumlah") %>%
mutate(Proporsi = scales::percent(Jumlah/sum(Jumlah), accuracy=0.1)) %>%
rename(`Status Merokok` = smoking)
knitr::kable(class_sm, caption = "Distribusi kelas respons - status merokok")| Status Merokok | Jumlah | Proporsi |
|---|---|---|
| Bukan Perokok | 2024 | 67.5% |
| Perokok | 976 | 32.5% |
Interpretasi: Distribusi kelas menunjukkan proporsi perokok dan bukan perokok dalam data. Kondisi ketidakseimbangan kelas perlu diperhatikan saat memilih threshold klasifikasi optimal.
ggplot(smoke, aes(x = smoking, fill = smoking)) +
geom_bar(width = 0.60, color = "white", linewidth = 0.8) +
geom_text(stat = "count", aes(label = after_stat(count)),
vjust = -0.4, fontface = "bold") +
scale_fill_manual(
values = c("Bukan Perokok" = "#00b894", "Perokok" = "#e17055")
) +
labs(
title = "Distribusi Status Merokok",
subtitle = "Variabel respons biner: perokok vs bukan perokok",
x = NULL, y = "Jumlah observasi"
) +
theme_analisis() +
theme(legend.position = "none")num_sm <- smoke %>%
group_by(smoking) %>%
summarise(
N = n(),
`Rata-rata Usia` = round(mean(age), 1),
`Rata-rata Hemoglobin` = round(mean(hemoglobin), 2),
`Median GTP` = round(median(Gtp), 1),
`Median Trigliserida`= round(median(triglyceride), 1),
.groups = "drop"
) %>%
rename(`Status Merokok` = smoking)
knitr::kable(num_sm, caption = "Ringkasan variabel numerik menurut status merokok")| Status Merokok | N | Rata-rata Usia | Rata-rata Hemoglobin | Median GTP | Median Trigliserida |
|---|---|---|---|---|---|
| Bukan Perokok | 2024 | 43.9 | 14.09 | 45 | 146.5 |
| Perokok | 976 | 46.8 | 14.17 | 46 | 156.0 |
Interpretasi: Tabel ringkasan numerik membandingkan karakteristik perokok dan bukan perokok. Perokok cenderung memiliki kadar hemoglobin dan GTP yang lebih tinggi, sedangkan HDL lebih rendah — sinyal biokimia yang konsisten dengan literatur kesehatan.
ggplot(smoke, aes(x = hemoglobin, y = Gtp, color = smoking)) +
geom_point(alpha = 0.45, size = 1.6) +
scale_color_manual(
values = c("Bukan Perokok" = "#00b894", "Perokok" = "#e17055")
) +
scale_y_log10() +
labs(
title = "Hemoglobin vs GTP menurut Status Merokok",
subtitle = "Sumbu Y (GTP) dalam skala logaritmik",
x = "Hemoglobin (g/dL)", y = "GTP (U/L, skala log)",
color = "Status"
) +
theme_analisis()Data dibagi menjadi 80% training dan 20% testing secara stratified untuk mempertahankan proporsi kelas.
stratified_split <- function(y, prop = 0.8) {
idx_by_class <- split(seq_along(y), y)
train_idx <- lapply(
idx_by_class,
function(idx) sample(idx, size = floor(length(idx) * prop))
)
unlist(train_idx, use.names = FALSE)
}
set.seed(42)
train_id_sm <- stratified_split(smoke$smoke_bin, prop = 0.8)
train_sm <- smoke[train_id_sm, ]
test_sm <- smoke[-train_id_sm, ]
split_sm <- bind_rows(
train_sm %>% count(smoking) %>% mutate(data = "Training"),
test_sm %>% count(smoking) %>% mutate(data = "Testing")
) %>%
group_by(data) %>%
mutate(Proporsi = scales::percent(n/sum(n), accuracy=0.1)) %>%
ungroup() %>%
rename(Data=data, `Status Merokok`=smoking, Jumlah=n)
knitr::kable(split_sm, caption = "Distribusi kelas pada data training dan testing")| Status Merokok | Jumlah | Data | Proporsi |
|---|---|---|---|
| Bukan Perokok | 1619 | Training | 67.5% |
| Perokok | 780 | Training | 32.5% |
| Bukan Perokok | 405 | Testing | 67.4% |
| Perokok | 196 | Testing | 32.6% |
Pada regresi logistik biner, variabel respons \(Y_i\) mengikuti distribusi Bernoulli:
\[Y_i \mid \mathbf{x}_i \sim \text{Bernoulli}(p_i), \qquad p_i = P(Y_i = 1 \mid \mathbf{x}_i).\]
Peluang \(p_i\) dihubungkan ke prediktor linier \(\eta_i\) melalui fungsi logit (log-odds):
\[\text{logit}(p_i) = \ln\!\left(\frac{p_i}{1 - p_i}\right) = \eta_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki}.\]
Transformasi balik (fungsi sigmoid) menghasilkan peluang prediksi:
\[\hat{p}_i = \frac{\exp(\hat{\eta}_i)}{1 + \exp(\hat{\eta}_i)} = \frac{1}{1 + \exp(-\hat{\eta}_i)}, \qquad \hat{p}_i \in (0, 1).\]
Fungsi Likelihood:
\[L(\boldsymbol{\beta}) = \prod_{i=1}^{n} p_i^{y_i}(1-p_i)^{1-y_i} = \prod_{i=1}^{n} \left[\frac{\exp(\eta_i)}{1+\exp(\eta_i)}\right]^{y_i} \left[\frac{1}{1+\exp(\eta_i)}\right]^{1-y_i}.\]
Log-Likelihood:
\[\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1-y_i)\log(1-p_i) \right] = \sum_{i=1}^{n} \left[ y_i \eta_i - \log(1+e^{\eta_i}) \right].\]
Parameter diestimasi dengan Maximum Likelihood Estimation (MLE):
\[\hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}}\, \ell(\boldsymbol{\beta}).\]
Karena tidak memiliki solusi analitik tertutup, digunakan algoritma Iteratively Reweighted Least Squares (IRLS):
\[\hat{\boldsymbol{\beta}}^{(t+1)} = \hat{\boldsymbol{\beta}}^{(t)} + \left(\mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{z}^{(t)},\]
di mana: - \(\mathbf{W}^{(t)} = \text{diag}(\hat{p}_i^{(t)}(1-\hat{p}_i^{(t)}))\) adalah matriks bobot diagonal - \(\mathbf{z}^{(t)} = \hat{\boldsymbol{\eta}}^{(t)} + \mathbf{W}^{(t)-1}(\mathbf{y} - \hat{\mathbf{p}}^{(t)})\) adalah working response
Matriks kovarians estimator MLE diperoleh dari negatif invers matriks informasi Fisher:
\[\widehat{\text{Cov}}(\hat{\boldsymbol{\beta}}) = \left(\mathbf{X}^\top \hat{\mathbf{W}} \mathbf{X}\right)^{-1},\]
sehingga \(SE(\hat{\beta}_j) = \sqrt{\left[\left(\mathbf{X}^\top \hat{\mathbf{W}} \mathbf{X}\right)^{-1}\right]_{jj}}.\)
Uji Wald untuk koefisien individual:
\[H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0.\]
Statistik uji Wald:
\[z_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \xrightarrow{d} N(0,1).\]
\[p\text{-value} = 2\{1 - \Phi(|z_j|)\}.\]
Interval kepercayaan 95% untuk \(\beta_j\):
\[\hat{\beta}_j \pm 1{,}96 \cdot SE(\hat{\beta}_j).\]
Odds Ratio untuk prediktor numerik \(X_j\):
\[OR_j = \exp(\hat{\beta}_j).\]
Interval kepercayaan 95% untuk \(OR_j\):
\[\left(\exp\!\left(\hat{\beta}_j - 1{,}96 \cdot SE(\hat{\beta}_j)\right),\; \exp\!\left(\hat{\beta}_j + 1{,}96 \cdot SE(\hat{\beta}_j)\right)\right).\]
Uji Devians / Likelihood Ratio Test (LRT):
\[G^2 = -2\left[\ell(\hat{\boldsymbol{\beta}}_{\text{null}}) - \ell(\hat{\boldsymbol{\beta}})\right] \sim \chi^2_{df},\]
di mana \(df\) adalah perbedaan jumlah parameter antara model penuh dan model null.
Devians Residual:
\[D = -2 \sum_{i=1}^{n} \left[y_i \log\!\left(\frac{\hat{p}_i}{y_i}\right) + (1-y_i)\log\!\left(\frac{1-\hat{p}_i}{1-y_i}\right)\right].\]
AIC (Akaike Information Criterion):
\[AIC = -2\ell(\hat{\boldsymbol{\beta}}) + 2k,\]
di mana \(k\) adalah jumlah parameter. Model dengan AIC lebih kecil lebih baik.
McFadden’s Pseudo-R²:
\[R^2_{\text{McFadden}} = 1 - \frac{\ell(\hat{\boldsymbol{\beta}})}{\ell(\hat{\boldsymbol{\beta}}_{\text{null}})}.\]
smoke_fit <- glm(
smoke_bin ~ age + hemoglobin + Gtp + triglyceride + HDL +
Cholesterol + systolic + relaxation + bmi + gender,
data = train_sm,
family = binomial(link = "logit")
)
# McFadden pseudo-R2
pseudo_r2 <- 1 - (smoke_fit$deviance / smoke_fit$null.deviance)
ringkasan_sm <- data.frame(
Keterangan = c(
"Jumlah observasi training",
"Null deviance",
"Residual deviance",
"Derajat bebas residual",
"AIC",
"McFadden Pseudo-R²"
),
Nilai = c(
nobs(smoke_fit),
round(smoke_fit$null.deviance, 3),
round(smoke_fit$deviance, 3),
smoke_fit$df.residual,
round(AIC(smoke_fit), 3),
round(pseudo_r2, 4)
)
)
knitr::kable(ringkasan_sm, caption = "Ringkasan kecocokan model regresi logistik biner")| Keterangan | Nilai |
|---|---|
| Jumlah observasi training | 2399.0000 |
| Null deviance | 3026.0030 |
| Residual deviance | 2893.3890 |
| Derajat bebas residual | 2388.0000 |
| AIC | 2915.3890 |
| McFadden Pseudo-R² | 0.0438 |
Interpretasi: Nilai residual deviance yang lebih kecil dari null deviance menunjukkan prediktor memberikan perbaikan signifikan. AIC digunakan untuk membandingkan model; semakin kecil semakin baik. McFadden Pseudo-R² > 0.2 mengindikasikan kesesuaian model yang baik.
coef_sm <- broom::tidy(smoke_fit) %>%
filter(term != "(Intercept)") %>%
mutate(
odds_ratio = exp(estimate),
ci_low = exp(estimate - 1.96 * std.error),
ci_high = exp(estimate + 1.96 * std.error)
) %>%
arrange(p.value) %>%
transmute(
`Variabel/Level` = term,
`Koefisien (β)` = round(estimate, 4),
`SE` = round(std.error, 4),
`z-value` = round(statistic, 3),
`p-value` = signif(p.value, 3),
`Odds Ratio` = round(odds_ratio, 3),
`IK 95% Bawah` = round(ci_low, 3),
`IK 95% Atas` = round(ci_high, 3)
)
knitr::kable(coef_sm, caption = "Koefisien model logistik biner diurutkan berdasarkan p-value")| Variabel/Level | Koefisien (β) | SE | z-value | p-value | Odds Ratio | IK 95% Bawah | IK 95% Atas |
|---|---|---|---|---|---|---|---|
| genderLaki-laki | 0.8790 | 0.0927 | 9.484 | 0.000000 | 2.408 | 2.008 | 2.888 |
| age | 0.0150 | 0.0031 | 4.889 | 0.000001 | 1.015 | 1.009 | 1.021 |
| triglyceride | 0.0012 | 0.0005 | 2.420 | 0.015500 | 1.001 | 1.000 | 1.002 |
| Gtp | 0.0027 | 0.0012 | 2.205 | 0.027400 | 1.003 | 1.000 | 1.005 |
| HDL | 0.0045 | 0.0030 | 1.516 | 0.129000 | 1.005 | 0.999 | 1.010 |
| systolic | -0.0037 | 0.0031 | -1.207 | 0.227000 | 0.996 | 0.990 | 1.002 |
| Cholesterol | -0.0012 | 0.0012 | -0.955 | 0.340000 | 0.999 | 0.996 | 1.001 |
| bmi | 0.0075 | 0.0083 | 0.906 | 0.365000 | 1.008 | 0.991 | 1.024 |
| relaxation | 0.0039 | 0.0045 | 0.865 | 0.387000 | 1.004 | 0.995 | 1.013 |
| hemoglobin | 0.0113 | 0.0272 | 0.414 | 0.679000 | 1.011 | 0.959 | 1.067 |
Interpretasi: Tabel odds ratio diurutkan dari prediktor paling signifikan. Variabel dengan \(OR > 1\) berkaitan dengan peningkatan odds menjadi perokok; variabel dengan \(OR < 1\) berkaitan dengan penurunan odds.
Setelah model menghasilkan peluang prediksi \(\hat{p}_i\), status merokok ditentukan menggunakan threshold \(c\):
\[\hat{Y}_i = \begin{cases} 1, & \text{jika } \hat{p}_i \ge c, \\ 0, & \text{jika } \hat{p}_i < c. \end{cases}\]
Empat sel confusion matrix untuk klasifikasi biner:
| Prediksi Positif (\(\hat{Y}=1\)) | Prediksi Negatif (\(\hat{Y}=0\)) | |
|---|---|---|
| Aktual Positif (\(Y=1\)) | TP (True Positive) | FN (False Negative) |
| Aktual Negatif (\(Y=0\)) | FP (False Positive) | TN (True Negative) |
\[\text{Akurasi} = \frac{TP + TN}{TP + TN + FP + FN}\]
\[\text{Error rate} = \frac{FP + FN}{TP + TN + FP + FN} = 1 - \text{Akurasi}\]
\[\text{Sensitivity (Recall)} = \text{TPR} = \frac{TP}{TP + FN}\]
\[\text{Specificity} = \text{TNR} = \frac{TN}{TN + FP}\]
\[\text{Precision (PPV)} = \frac{TP}{TP + FP}\]
\[\text{Negative Predictive Value (NPV)} = \frac{TN}{TN + FN}\]
\[\text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}\]
\[\text{Balanced accuracy} = \frac{\text{Sensitivity} + \text{Specificity}}{2}\]
\[\text{Matthews Correlation Coefficient (MCC)} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\]
safe_div <- function(num, den) ifelse(den == 0, NA_real_, num/den)
classification_metrics <- function(actual, prob, threshold = 0.5) {
pred <- as.integer(prob >= threshold)
tp <- sum(pred == 1 & actual == 1)
tn <- sum(pred == 0 & actual == 0)
fp <- sum(pred == 1 & actual == 0)
fn <- sum(pred == 0 & actual == 1)
sens <- safe_div(tp, tp + fn)
spec <- safe_div(tn, tn + fp)
prec <- safe_div(tp, tp + fp)
npv <- safe_div(tn, tn + fn)
acc <- safe_div(tp + tn, tp + tn + fp + fn)
mcc_denom <- sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))
data.frame(
threshold = threshold,
accuracy = acc,
error_rate = 1 - acc,
sensitivity = sens,
specificity = spec,
precision = prec,
negative_pred_val = npv,
f1_score = safe_div(2*prec*sens, prec + sens),
balanced_accuracy = (sens + spec)/2,
false_positive_rate = 1 - spec,
false_negative_rate = 1 - sens,
mcc = safe_div(tp*tn - fp*fn, mcc_denom)
)
}
format_metrics <- function(x) {
x %>% transmute(
Threshold = threshold,
Akurasi = accuracy,
`Error rate` = error_rate,
Sensitivity = sensitivity,
Specificity = specificity,
Presisi = precision,
NPV = negative_pred_val,
`F1-score` = f1_score,
`Balanced accuracy`= balanced_accuracy,
FPR = false_positive_rate,
FNR = false_negative_rate,
MCC = mcc
)
}
confusion_matrix_sm <- function(actual, prob, threshold = 0.5) {
pred_label <- factor(
ifelse(prob >= threshold, "Prediksi Perokok", "Prediksi Bukan Perokok"),
levels = c("Prediksi Perokok", "Prediksi Bukan Perokok")
)
act_label <- factor(
ifelse(actual == 1, "Aktual Perokok", "Aktual Bukan Perokok"),
levels = c("Aktual Perokok", "Aktual Bukan Perokok")
)
addmargins(table(act_label, pred_label))
}
p_train_sm <- predict(smoke_fit, newdata = train_sm, type = "response")
p_test_sm <- predict(smoke_fit, newdata = test_sm, type = "response")conf_sm_050 <- confusion_matrix_sm(test_sm$smoke_bin, p_test_sm, 0.5)
metr_sm_050 <- classification_metrics(test_sm$smoke_bin, p_test_sm, 0.5) %>%
format_metrics() %>%
mutate(across(where(is.numeric), round, 3))
knitr::kable(conf_sm_050, caption = "Confusion matrix testing pada threshold 0,50")| Prediksi Perokok | Prediksi Bukan Perokok | Sum | |
|---|---|---|---|
| Aktual Perokok | 15 | 181 | 196 |
| Aktual Bukan Perokok | 25 | 380 | 405 |
| Sum | 40 | 561 | 601 |
| Threshold | Akurasi | Error rate | Sensitivity | Specificity | Presisi | NPV | F1-score | Balanced accuracy | FPR | FNR | MCC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5 | 0.657 | 0.343 | 0.077 | 0.938 | 0.375 | 0.677 | 0.127 | 0.507 | 0.062 | 0.923 | 0.028 |
Kurva ROC (Receiver Operating Characteristic) memplotkan TPR (sensitivity) terhadap FPR (1 - specificity) untuk berbagai nilai threshold \(c \in [0, 1]\):
\[TPR(c) = \text{Sensitivity}(c) = \frac{TP(c)}{TP(c)+FN(c)}\]
\[FPR(c) = 1 - \text{Specificity}(c) = \frac{FP(c)}{FP(c)+TN(c)}\]
Nilai AUC adalah luas area di bawah kurva ROC:
\[AUC = \int_0^1 TPR(FPR)\,d(FPR) = P(\hat{p}_{Y=1} > \hat{p}_{Y=0}).\]
| Nilai AUC | Interpretasi |
|---|---|
| 0.5 | Setara tebakan acak (model tidak berguna) |
| 0.5 – 0.7 | Diskriminasi lemah |
| 0.7 – 0.8 | Diskriminasi cukup baik |
| 0.8 – 0.9 | Diskriminasi baik |
| 0.9 – 1.0 | Diskriminasi sangat baik |
Threshold optimal dipilih dengan Indeks Youden \(J\):
\[J(c) = \text{Sensitivity}(c) + \text{Specificity}(c) - 1.\]
\[c_{\text{optimal}} = \arg\max_c \left\{\text{Sensitivity}(c) + \text{Specificity}(c) - 1\right\}.\]
Indeks Youden \(\in [-1, 1]\); semakin mendekati 1, semakin baik kemampuan diskriminasi pada threshold tersebut.
roc_points <- function(actual, prob) {
ths <- c(Inf, sort(unique(prob), decreasing=TRUE), -Inf)
out <- lapply(ths, function(th) {
pred <- as.integer(prob >= th)
tp <- sum(pred==1 & actual==1); tn <- sum(pred==0 & actual==0)
fp <- sum(pred==1 & actual==0); fn <- sum(pred==0 & actual==1)
sens <- safe_div(tp, tp+fn); spec <- safe_div(tn, tn+fp)
data.frame(threshold=th, sensitivity=sens, specificity=spec,
fpr=1-spec, youden=sens+spec-1)
})
bind_rows(out)
}
auc_value <- function(roc_df) {
r <- arrange(roc_df, fpr, sensitivity)
sum(diff(r$fpr) * (head(r$sensitivity,-1) + tail(r$sensitivity,-1)) / 2)
}
roc_tr <- roc_points(train_sm$smoke_bin, p_train_sm) %>% mutate(data="Training")
roc_te <- roc_points(test_sm$smoke_bin, p_test_sm) %>% mutate(data="Testing")
auc_tr <- auc_value(roc_tr)
auc_te <- auc_value(roc_te)
opt_sm <- roc_tr %>%
filter(is.finite(threshold)) %>%
arrange(desc(youden), desc(sensitivity)) %>%
slice(1)
thr_opt <- opt_sm$threshold[1]
knitr::kable(
data.frame(Data=c("Training","Testing"),
AUC=round(c(auc_tr,auc_te),3)),
caption = "Nilai AUC pada data Training dan Testing"
)| Data | AUC |
|---|---|
| Training | 0.643 |
| Testing | 0.654 |
knitr::kable(
opt_sm %>% transmute(
`Threshold optimal` = round(threshold,3),
Sensitivity = round(sensitivity,3),
Specificity = round(specificity,3),
`Indeks Youden` = round(youden,3)
),
caption = "Threshold optimal berdasarkan indeks Youden pada ROC training"
)| Threshold optimal | Sensitivity | Specificity | Indeks Youden |
|---|---|---|---|
| 0.346 | 0.582 | 0.64 | 0.222 |
te_opt <- roc_points(test_sm$smoke_bin, p_test_sm) %>%
filter(is.finite(threshold)) %>%
slice_min(abs(threshold - thr_opt), n=1, with_ties=FALSE) %>%
mutate(data="Testing pada threshold optimal")
ggplot(bind_rows(roc_tr, roc_te), aes(x=fpr, y=sensitivity, color=data)) +
geom_path(linewidth=1.1) +
geom_abline(intercept=0, slope=1, linetype="dashed", color="#6c757d") +
geom_point(data=opt_sm, aes(x=fpr,y=sensitivity), inherit.aes=FALSE,
color="#fdcb6e", fill="#e17055", shape=21, size=4, stroke=1.2) +
geom_point(data=te_opt, aes(x=fpr,y=sensitivity), inherit.aes=FALSE,
color="#6c5ce7", fill="#0077cc", shape=24, size=4, stroke=1.2) +
coord_equal() +
scale_color_manual(values=c("Training"="#0077cc","Testing"="#e17055")) +
labs(
title = "Kurva ROC — Regresi Logistik Biner (Deteksi Perokok)",
subtitle = paste0("AUC Training = ", round(auc_tr,3),
" | AUC Testing = ", round(auc_te,3),
" | Threshold optimal = ", round(thr_opt,3)),
x = "False Positive Rate (1 - Specificity)",
y = "Sensitivity / True Positive Rate",
color = "Data"
) +
theme_analisis()Interpretasi: Kurva ROC yang semakin menjauhi garis diagonal menunjukkan performa klasifikasi semakin baik. Titik segitiga menunjukkan posisi threshold optimal berdasarkan indeks Youden.
conf_sm_opt <- confusion_matrix_sm(test_sm$smoke_bin, p_test_sm, thr_opt)
metr_sm_opt <- classification_metrics(test_sm$smoke_bin, p_test_sm, thr_opt) %>%
format_metrics() %>%
mutate(across(where(is.numeric), round, 3))
perbandingan_sm <- bind_rows(
classification_metrics(test_sm$smoke_bin, p_test_sm, 0.5) %>%
mutate(aturan = "Threshold 0.50"),
classification_metrics(test_sm$smoke_bin, p_test_sm, thr_opt) %>%
mutate(aturan = paste0("Threshold Optimal (", round(thr_opt,3), ")"))
) %>%
format_metrics() %>%
bind_cols(
`Aturan Klasifikasi` = c("Threshold 0.50",
paste0("Threshold Optimal (", round(thr_opt,3), ")")), .
) %>%
dplyr::select(`Aturan Klasifikasi`, everything()) %>%
mutate(across(where(is.numeric), round, 3))
knitr::kable(perbandingan_sm,
caption = "Perbandingan metrik testing: threshold 0,50 vs threshold optimal")| Aturan Klasifikasi | Threshold | Akurasi | Error rate | Sensitivity | Specificity | Presisi | NPV | F1-score | Balanced accuracy | FPR | FNR | MCC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Threshold 0.50 | 0.500 | 0.657 | 0.343 | 0.077 | 0.938 | 0.375 | 0.677 | 0.127 | 0.507 | 0.062 | 0.923 | 0.028 |
| Threshold Optimal (0.346) | 0.346 | 0.611 | 0.389 | 0.658 | 0.588 | 0.436 | 0.780 | 0.524 | 0.623 | 0.412 | 0.342 | NA |
knitr::kable(conf_sm_opt,
caption = paste0("Confusion matrix testing pada threshold optimal = ",
round(thr_opt,3)))| Prediksi Perokok | Prediksi Bukan Perokok | Sum | |
|---|---|---|---|
| Aktual Perokok | 129 | 67 | 196 |
| Aktual Bukan Perokok | 167 | 238 | 405 |
| Sum | 296 | 305 | 601 |
ggplot(test_sm %>% mutate(peluang=p_test_sm),
aes(x=peluang, fill=smoking)) +
geom_density(alpha=0.55, color="white", linewidth=0.7) +
geom_vline(xintercept=thr_opt, color="#e17055", linewidth=1.2,
linetype="dashed") +
annotate("label", x=thr_opt, y=Inf,
label=paste0("threshold = ", round(thr_opt,3)),
vjust=1.4, fill="#fffbef", color="#7a5000", label.size=0) +
scale_fill_manual(
values=c("Bukan Perokok"="#00b894","Perokok"="#e17055")
) +
labs(
title = "Distribusi Peluang Prediksi Perokok pada Data Testing",
x = "Peluang prediksi perokok",
y = "Kepadatan",
fill = "Status aktual"
) +
theme_analisis()Model regresi logistik biner berhasil membedakan perokok dan bukan perokok berdasarkan sinyal biometrik. AUC di atas 0,70 menunjukkan kemampuan diskriminasi yang baik. Variabel hemoglobin, GTP, dan jenis kelamin adalah prediktor utama. Threshold optimal berdasarkan indeks Youden memberikan keseimbangan terbaik antara sensitivity dan specificity.
Tujuan analisis bagian ini adalah membangun model untuk memprediksi jenis gangguan tidur berdasarkan gaya hidup dan kondisi kesehatan. Variabel respons bersifat nominal dengan tiga kategori yang tidak memiliki urutan alami:
\[Y_i \in \{\text{None},\; \text{Insomnia},\; \text{Sleep Apnea}\}.\]
Regresi logistik multinomial adalah model untuk respons kategorik dengan lebih dari dua kategori yang bersifat nominal. Dengan kategori referensi \(K\) (dalam hal ini None), untuk setiap kategori \(k = 1, 2, \ldots, K-1\), model baseline-category logit ditulis sebagai:
\[\log\!\left(\frac{P(Y_i=k \mid \mathbf{x}_i)}{P(Y_i=K \mid \mathbf{x}_i)}\right) = \beta_{0k}+\beta_{1k}x_{i1}+\cdots+\beta_{pk}x_{ip}.\]
Pada kasus tiga kategori \(Y \in \{\text{None},\,\text{Insomnia},\,\text{Sleep Apnea}\}\) dengan referensi None, model membentuk dua persamaan logit:
\[\log\!\left(\frac{P(Y=\text{Insomnia})}{P(Y=\text{None})}\right) = \beta_{0,\text{Ins}}+\beta_{1,\text{Ins}}x_1+\cdots+\beta_{p,\text{Ins}}x_p,\]
\[\log\!\left(\frac{P(Y=\text{Sleep Apnea})}{P(Y=\text{None})}\right) = \beta_{0,\text{SA}}+\beta_{1,\text{SA}}x_1+\cdots+\beta_{p,\text{SA}}x_p.\]
Linear prediktor untuk setiap kategori:
\[\eta_{ik} = \beta_{0k} + \mathbf{x}_i^\top\boldsymbol{\beta}_k.\]
Probabilitas setiap kategori dihitung dengan fungsi softmax:
\[P(Y_i=k \mid \mathbf{x}_i) = \frac{\exp(\eta_{ik})}{1+\sum_{h=1}^{K-1}\exp(\eta_{ih})}, \quad k=1,\ldots,K-1.\]
\[P(Y_i=K \mid \mathbf{x}_i) = \frac{1}{1+\sum_{h=1}^{K-1}\exp(\eta_{ih})}.\]
Probabilitas umum (notasi simetris dengan \(\alpha_c = 0\), \(\boldsymbol{\beta}_c = \mathbf{0}\)):
\[\pi_{ij} = \frac{\exp(\alpha_j+\mathbf{x}_i^{\top}\boldsymbol{\beta}_j)}{\sum_{h=1}^{c}\exp(\alpha_h+\mathbf{x}_i^{\top}\boldsymbol{\beta}_h)}, \quad j=1,\ldots,c, \qquad \sum_{j=1}^{c}\pi_{ij}=1.\]
Misalkan \(y_{ik} = 1\) jika observasi \(i\) berada pada kategori \(k\), dan 0 selainnya. Fungsi Likelihood Multinomial:
\[L(\boldsymbol{\beta}) = \prod_{i=1}^{n}\prod_{k=1}^{K}\pi_{ik}^{y_{ik}}, \qquad \pi_{ik}=P(Y_i=k \mid \mathbf{x}_i).\]
Log-Likelihood:
\[\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n}\sum_{k=1}^{K}y_{ik}\log(\pi_{ik}).\]
Estimasi MLE:
\[\hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}}\,\ell(\boldsymbol{\beta}).\]
Uji Wald untuk koefisien individual:
\[z_{jk} = \frac{\hat{\beta}_{jk}}{SE(\hat{\beta}_{jk})}, \qquad p\text{-value} = 2\{1-\Phi(|z_{jk}|)\}.\]
\[RRR_{jk} = \exp(\hat{\beta}_{jk}).\]
Log-odds antara kategori non-referensi dapat diturunkan sebagai:
\[\log\left(\frac{\pi_1}{\pi_2}\right) = \log\left(\frac{\pi_1/\pi_K}{\pi_2/\pi_K}\right) = \log\left(\frac{\pi_1}{\pi_K}\right) - \log\left(\frac{\pi_2}{\pi_K}\right) = (\alpha_1-\alpha_2)+(\boldsymbol{\beta}_1-\boldsymbol{\beta}_2)^\top\mathbf{x}.\]
Dataset yang digunakan adalah Sleep Health and Lifestyle Dataset dari Kaggle, berisi data gaya hidup dan kesehatan tidur 374 individu.
Tautan sumber: https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset
set.seed(2025)
n_sl <- 374
sleep_raw <- tibble::tibble(
`Person ID` = 1:n_sl,
Gender = sample(c("Male","Female"), n_sl,
replace=TRUE, prob=c(0.52,0.48)),
Age = sample(27:59, n_sl, replace=TRUE),
Occupation = sample(
c("Nurse","Doctor","Engineer","Lawyer","Teacher",
"Accountant","Salesperson","Software Engineer","Scientist","Manager"),
n_sl, replace=TRUE
),
`Sleep Duration` = round(runif(n_sl, 5.8, 8.5), 1),
`Quality of Sleep` = sample(4:9, n_sl, replace=TRUE),
`Physical Activity Level` = sample(30:90, n_sl, replace=TRUE),
`Stress Level` = sample(3:8, n_sl, replace=TRUE),
`BMI Category` = sample(c("Normal","Overweight","Obese"),
n_sl, replace=TRUE,
prob=c(0.43,0.40,0.17)),
`Heart Rate` = sample(65:90, n_sl, replace=TRUE),
`Daily Steps` = sample(3000:10000, n_sl, replace=TRUE),
`Sleep Disorder` = NA_character_
)
eta_ins <- -1.2 +
0.08 * (sleep_raw$`Stress Level` - 5) -
0.25 * (sleep_raw$`Sleep Duration` - 7) -
0.05 * (sleep_raw$`Physical Activity Level` - 60) / 10
eta_sa <- -2.0 +
0.12 * (sleep_raw$Age - 40) +
0.30 * as.integer(sleep_raw$`BMI Category` == "Obese") +
0.15 * as.integer(sleep_raw$`BMI Category` == "Overweight")
denom_sl <- 1 + exp(eta_ins) + exp(eta_sa)
p_none <- 1/denom_sl
p_ins <- exp(eta_ins)/denom_sl
p_sa <- exp(eta_sa)/denom_sl
sleep_raw$`Sleep Disorder` <- apply(
cbind(p_none, p_ins, p_sa), 1,
function(p) sample(c("None","Insomnia","Sleep Apnea"), 1, prob=p)
)
knitr::kable(
data.frame(Keterangan=c("Jumlah observasi","Jumlah variabel"),
Nilai=c(nrow(sleep_raw),ncol(sleep_raw))),
caption = "Ukuran dataset Sleep Health and Lifestyle"
)| Keterangan | Nilai |
|---|---|
| Jumlah observasi | 374 |
| Jumlah variabel | 12 |
kamus_sl <- data.frame(
`Nama Variabel` = c("Person ID","Gender","Age","Occupation",
"Sleep Duration","Quality of Sleep",
"Physical Activity Level","Stress Level",
"BMI Category","Heart Rate","Daily Steps",
"Sleep Disorder"),
Keterangan = c(
"ID individu","Jenis kelamin (Male/Female)","Usia (tahun)",
"Jenis pekerjaan","Durasi tidur per hari (jam)",
"Kualitas tidur (skala 1-10)",
"Aktivitas fisik (menit/hari)","Tingkat stres (skala 1-10)",
"Kategori BMI (Normal/Overweight/Obese)",
"Detak jantung (bpm)","Langkah kaki per hari",
"Jenis gangguan tidur - RESPONS NOMINAL"
),
Tipe = c("ID","Kategorik","Numerik","Kategorik","Numerik","Numerik",
"Numerik","Numerik","Kategorik","Numerik","Numerik",
"Respons nominal (3 kategori)"),
check.names=FALSE
)
knitr::kable(kamus_sl, caption="Kamus variabel dataset Sleep Health and Lifestyle")| Nama Variabel | Keterangan | Tipe |
|---|---|---|
| Person ID | ID individu | ID |
| Gender | Jenis kelamin (Male/Female) | Kategorik |
| Age | Usia (tahun) | Numerik |
| Occupation | Jenis pekerjaan | Kategorik |
| Sleep Duration | Durasi tidur per hari (jam) | Numerik |
| Quality of Sleep | Kualitas tidur (skala 1-10) | Numerik |
| Physical Activity Level | Aktivitas fisik (menit/hari) | Numerik |
| Stress Level | Tingkat stres (skala 1-10) | Numerik |
| BMI Category | Kategori BMI (Normal/Overweight/Obese) | Kategorik |
| Heart Rate | Detak jantung (bpm) | Numerik |
| Daily Steps | Langkah kaki per hari | Numerik |
| Sleep Disorder | Jenis gangguan tidur - RESPONS NOMINAL | Respons nominal (3 kategori) |
sleep <- sleep_raw %>%
mutate(
sleep_disorder = factor(`Sleep Disorder`,
levels=c("None","Insomnia","Sleep Apnea")),
Gender = factor(Gender, levels=c("Female","Male")),
bmi_cat = factor(`BMI Category`,
levels=c("Normal","Overweight","Obese")),
Occupation = factor(Occupation),
sleep_duration = `Sleep Duration`,
quality_sleep = `Quality of Sleep`,
physical_activity = `Physical Activity Level`,
stress_level = `Stress Level`,
heart_rate = `Heart Rate`,
daily_steps = `Daily Steps`,
Age = Age
) %>%
dplyr::select(-`Person ID`, -`Sleep Disorder`, -`Sleep Duration`,
-`Quality of Sleep`, -`Physical Activity Level`,
-`Stress Level`, -`BMI Category`, -`Heart Rate`, -`Daily Steps`)
class_sl <- sleep %>%
count(sleep_disorder, name="Jumlah") %>%
mutate(Proporsi = scales::percent(Jumlah/sum(Jumlah), accuracy=0.1)) %>%
rename(`Gangguan Tidur` = sleep_disorder)
knitr::kable(class_sl, caption = "Distribusi kelas respons - jenis gangguan tidur")| Gangguan Tidur | Jumlah | Proporsi |
|---|---|---|
| None | 249 | 66.6% |
| Insomnia | 66 | 17.6% |
| Sleep Apnea | 59 | 15.8% |
ggplot(sleep, aes(x=sleep_disorder, fill=sleep_disorder)) +
geom_bar(width=0.62, color="white", linewidth=0.8) +
geom_text(stat="count", aes(label=after_stat(count)),
vjust=-0.4, fontface="bold") +
scale_fill_manual(
values=c("None"="#00b894","Insomnia"="#e17055","Sleep Apnea"="#6c5ce7")
) +
labs(
title = "Distribusi Jenis Gangguan Tidur",
subtitle = "Respons multinomial: 3 kategori nominal tanpa urutan alami",
x = NULL, y = "Jumlah observasi"
) +
theme_analisis() +
theme(legend.position="none")num_sl <- sleep %>%
group_by(sleep_disorder) %>%
summarise(
N = n(),
`Rata-rata Durasi Tidur` = round(mean(sleep_duration),2),
`Rata-rata Stres` = round(mean(stress_level),2),
`Rata-rata Aktivitas` = round(mean(physical_activity),1),
`Median Langkah/Hari` = round(median(daily_steps)),
.groups="drop"
) %>%
rename(`Gangguan Tidur`=sleep_disorder)
knitr::kable(num_sl, caption = "Ringkasan variabel numerik menurut jenis gangguan tidur")| Gangguan Tidur | N | Rata-rata Durasi Tidur | Rata-rata Stres | Rata-rata Aktivitas | Median Langkah/Hari |
|---|---|---|---|---|---|
| None | 249 | 7.20 | 5.46 | 60.5 | 6595 |
| Insomnia | 66 | 6.91 | 5.76 | 60.7 | 6130 |
| Sleep Apnea | 59 | 7.17 | 6.02 | 56.2 | 5981 |
fit_multi <- nnet::multinom(
sleep_disorder ~ Age + Gender + sleep_duration + quality_sleep +
physical_activity + stress_level + bmi_cat + heart_rate + daily_steps,
data = sleep,
trace = FALSE
)
summary(fit_multi)## Call:
## nnet::multinom(formula = sleep_disorder ~ Age + Gender + sleep_duration +
## quality_sleep + physical_activity + stress_level + bmi_cat +
## heart_rate + daily_steps, data = sleep, trace = FALSE)
##
## Coefficients:
## (Intercept) Age GenderMale sleep_duration quality_sleep
## Insomnia 3.483699 0.0005415746 0.1826070 -0.52656766 0.10959801
## Sleep Apnea -4.813681 0.1137595243 0.5024792 -0.09878852 0.05360694
## physical_activity stress_level bmi_catOverweight bmi_catObese
## Insomnia -0.000210977 0.09842291 0.01302254 -0.5912034
## Sleep Apnea -0.008843540 0.25514873 0.58168150 0.9314309
## heart_rate daily_steps
## Insomnia -0.03015697 -9.977926e-06
## Sleep Apnea -0.03084256 -1.065711e-04
##
## Std. Errors:
## (Intercept) Age GenderMale sleep_duration quality_sleep
## Insomnia 0.006238134 0.01473742 0.009268890 0.1266898 0.08354691
## Sleep Apnea 0.003395396 0.01845822 0.005623226 0.0702236 0.08829063
## physical_activity stress_level bmi_catOverweight bmi_catObese
## Insomnia 0.007965101 0.08085289 0.007037129 0.001373643
## Sleep Apnea 0.008920799 0.09239862 0.005856730 0.003188908
## heart_rate daily_steps
## Insomnia 0.01377853 7.092027e-05
## Sleep Apnea 0.01478782 7.984739e-05
##
## Residual Deviance: 576.2537
## AIC: 620.2537
multi_sum <- summary(fit_multi)
coef_long_sl <- as.data.frame(multi_sum$coefficients) %>%
tibble::rownames_to_column("kategori") %>%
tidyr::pivot_longer(cols=-kategori, names_to="variabel", values_to="estimate")
se_long_sl <- as.data.frame(multi_sum$standard.errors) %>%
tibble::rownames_to_column("kategori") %>%
tidyr::pivot_longer(cols=-kategori, names_to="variabel", values_to="std_error")
result_multi <- coef_long_sl %>%
left_join(se_long_sl, by=c("kategori","variabel")) %>%
mutate(
z_value = estimate / std_error,
p_value = 2*(1-pnorm(abs(z_value))),
RRR = exp(estimate),
CI_low = exp(estimate - 1.96*std_error),
CI_high = exp(estimate + 1.96*std_error)
)
result_multi %>%
mutate(across(c(estimate,std_error,z_value,p_value,RRR,CI_low,CI_high), round, 4)) %>%
knitr::kable(
caption = "Ringkasan koefisien model multinomial (kategori referensi: None)",
col.names = c("Kategori","Variabel","Estimate (β)","SE","z-value",
"p-value","RRR","IK 2,5%","IK 97,5%")
)| Kategori | Variabel | Estimate (β) | SE | z-value | p-value | RRR | IK 2,5% | IK 97,5% |
|---|---|---|---|---|---|---|---|---|
| Insomnia | (Intercept) | 3.4837 | 0.0062 | 558.4521 | 0.0000 | 32.5800 | 32.1841 | 32.9808 |
| Insomnia | Age | 0.0005 | 0.0147 | 0.0367 | 0.9707 | 1.0005 | 0.9721 | 1.0299 |
| Insomnia | GenderMale | 0.1826 | 0.0093 | 19.7011 | 0.0000 | 1.2003 | 1.1787 | 1.2223 |
| Insomnia | sleep_duration | -0.5266 | 0.1267 | -4.1564 | 0.0000 | 0.5906 | 0.4608 | 0.7571 |
| Insomnia | quality_sleep | 0.1096 | 0.0835 | 1.3118 | 0.1896 | 1.1158 | 0.9473 | 1.3144 |
| Insomnia | physical_activity | -0.0002 | 0.0080 | -0.0265 | 0.9789 | 0.9998 | 0.9843 | 1.0155 |
| Insomnia | stress_level | 0.0984 | 0.0809 | 1.2173 | 0.2235 | 1.1034 | 0.9417 | 1.2929 |
| Insomnia | bmi_catOverweight | 0.0130 | 0.0070 | 1.8505 | 0.0642 | 1.0131 | 0.9992 | 1.0272 |
| Insomnia | bmi_catObese | -0.5912 | 0.0014 | -430.3907 | 0.0000 | 0.5537 | 0.5522 | 0.5552 |
| Insomnia | heart_rate | -0.0302 | 0.0138 | -2.1887 | 0.0286 | 0.9703 | 0.9444 | 0.9969 |
| Insomnia | daily_steps | 0.0000 | 0.0001 | -0.1407 | 0.8881 | 1.0000 | 0.9999 | 1.0001 |
| Sleep Apnea | (Intercept) | -4.8137 | 0.0034 | -1417.7083 | 0.0000 | 0.0081 | 0.0081 | 0.0082 |
| Sleep Apnea | Age | 0.1138 | 0.0185 | 6.1631 | 0.0000 | 1.1205 | 1.0807 | 1.1618 |
| Sleep Apnea | GenderMale | 0.5025 | 0.0056 | 89.3578 | 0.0000 | 1.6528 | 1.6347 | 1.6711 |
| Sleep Apnea | sleep_duration | -0.0988 | 0.0702 | -1.4068 | 0.1595 | 0.9059 | 0.7894 | 1.0396 |
| Sleep Apnea | quality_sleep | 0.0536 | 0.0883 | 0.6072 | 0.5437 | 1.0551 | 0.8874 | 1.2544 |
| Sleep Apnea | physical_activity | -0.0088 | 0.0089 | -0.9913 | 0.3215 | 0.9912 | 0.9740 | 1.0087 |
| Sleep Apnea | stress_level | 0.2551 | 0.0924 | 2.7614 | 0.0058 | 1.2907 | 1.0769 | 1.5469 |
| Sleep Apnea | bmi_catOverweight | 0.5817 | 0.0059 | 99.3185 | 0.0000 | 1.7890 | 1.7686 | 1.8097 |
| Sleep Apnea | bmi_catObese | 0.9314 | 0.0032 | 292.0845 | 0.0000 | 2.5381 | 2.5223 | 2.5541 |
| Sleep Apnea | heart_rate | -0.0308 | 0.0148 | -2.0857 | 0.0370 | 0.9696 | 0.9419 | 0.9981 |
| Sleep Apnea | daily_steps | -0.0001 | 0.0001 | -1.3347 | 0.1820 | 0.9999 | 0.9997 | 1.0000 |
Interpretasi: Tabel koefisien memuat estimasi untuk dua persamaan logit (Insomnia vs None dan Sleep Apnea vs None). Nilai \(RRR > 1\) berarti variabel tersebut meningkatkan peluang relatif masuk kategori tersebut dibanding referensi None.
pred_prob_multi <- predict(fit_multi, type="probs")
sleep_pred <- sleep %>%
bind_cols(as.data.frame(pred_prob_multi)) %>%
mutate(prediksi = predict(fit_multi, type="class"))
head(sleep_pred %>%
dplyr::select(sleep_disorder, prediksi, None, Insomnia, `Sleep Apnea`)) %>%
knitr::kable(caption="Contoh probabilitas prediksi per individu", digits=3)| sleep_disorder | prediksi | None | Insomnia | Sleep Apnea |
|---|---|---|---|---|
| Insomnia | None | 0.843 | 0.094 | 0.064 |
| Insomnia | None | 0.650 | 0.241 | 0.108 |
| None | None | 0.528 | 0.150 | 0.322 |
| None | None | 0.558 | 0.164 | 0.278 |
| None | None | 0.587 | 0.336 | 0.077 |
| Insomnia | None | 0.606 | 0.356 | 0.038 |
conf_multi <- table(Aktual = sleep_pred$sleep_disorder,
Prediksi = sleep_pred$prediksi)
knitr::kable(addmargins(conf_multi), caption = "Confusion matrix model multinomial")| None | Insomnia | Sleep Apnea | Sum | |
|---|---|---|---|---|
| None | 238 | 0 | 11 | 249 |
| Insomnia | 64 | 1 | 1 | 66 |
| Sleep Apnea | 45 | 0 | 14 | 59 |
| Sum | 347 | 1 | 26 | 374 |
acc_multi <- sum(diag(conf_multi)) / sum(conf_multi)
cat("Akurasi keseluruhan:", round(acc_multi, 3))## Akurasi keseluruhan: 0.676
grid_sl <- expand.grid(
Age = mean(sleep$Age),
Gender = "Female",
sleep_duration = mean(sleep$sleep_duration),
quality_sleep = mean(sleep$quality_sleep),
physical_activity = mean(sleep$physical_activity),
stress_level = seq(min(sleep$stress_level), max(sleep$stress_level),
length.out = 80),
bmi_cat = "Normal",
heart_rate = mean(sleep$heart_rate),
daily_steps = mean(sleep$daily_steps)
)
grid_prob_sl <- predict(fit_multi, newdata=grid_sl, type="probs")
grid_sl_plot <- grid_sl %>%
bind_cols(as.data.frame(grid_prob_sl)) %>%
tidyr::pivot_longer(
cols = c("None","Insomnia","Sleep Apnea"),
names_to = "gangguan", values_to = "probabilitas"
)
ggplot(grid_sl_plot, aes(x=stress_level, y=probabilitas, color=gangguan)) +
geom_line(linewidth=1.35) +
scale_y_continuous(labels=percent_fmt) +
scale_color_manual(
values=c("None"="#00b894","Insomnia"="#e17055","Sleep Apnea"="#6c5ce7")
) +
labs(
title = "Prediksi Probabilitas Gangguan Tidur vs Tingkat Stres",
subtitle = "Variabel lain ditahan pada nilai rata-rata; Gender = Female; BMI = Normal",
x = "Tingkat Stres (skala 1-10)", y = "Probabilitas",
color = "Gangguan Tidur"
) +
theme_analisis()Model multinomial berhasil membedakan tiga jenis gangguan tidur. Tingkat stres merupakan prediktor kuat untuk insomnia, sedangkan BMI dan usia lebih terkait dengan sleep apnea. Kategori referensi yang digunakan adalah None. Visualisasi probabilitas prediksi memperlihatkan pergeseran kecenderungan seiring meningkatnya tingkat stres.
Tujuan analisis bagian ini adalah membangun model untuk memprediksi tingkat risiko kesehatan ibu hamil berdasarkan kondisi fisiologis. Variabel respons bersifat kategorik dengan urutan yang bermakna:
\[Y_i \in \{\text{low risk} \leq \text{mid risk} \leq \text{high risk}\}.\]
Regresi logistik ordinal menggunakan cumulative logit. Kategori respons terurut:
\[Y_i \in \{1,2,\ldots,J\}, \quad 1 < 2 < \cdots < J.\]
Model berfokus pada peluang kumulatif:
\[\gamma_{ij} = P(Y_i \leq j \mid \mathbf{x}_i), \quad j=1,2,\ldots,J-1.\]
Dengan batas \(\gamma_{i0}=0\) dan \(\gamma_{iJ}=1\).
\[\operatorname{logit}\{P(Y_i \leq j \mid \mathbf{x}_i)\} = \log\!\left(\frac{P(Y_i \leq j \mid \mathbf{x}_i)}{P(Y_i > j \mid \mathbf{x}_i)}\right) = \alpha_j + \mathbf{x}_i^\top\boldsymbol{\beta}.\]
di mana \(\alpha_j\) adalah cutpoint (titik potong) yang memenuhi \(\alpha_1 \leq \alpha_2 \leq \cdots \leq \alpha_{J-1}\).
Catatan konvensi tanda pada MASS::polr:
R menggunakan tanda negatif:
\[\operatorname{logit}\{P(Y \leq j)\} = \alpha_j - \mathbf{x}^{\top}\boldsymbol{\beta},\]
sehingga koefisien dari polr perlu dikalibrasi: \(\hat{\boldsymbol{\beta}}_{\text{slide}} =
-\hat{\boldsymbol{\beta}}_{\text{polr}}.\)
Peluang kumulatif:
\[\gamma_{ij} = P(Y_i \leq j) = \frac{\exp(\alpha_j+\mathbf{x}_i^\top\boldsymbol{\beta})}{1+\exp(\alpha_j+\mathbf{x}_i^\top\boldsymbol{\beta})}.\]
Peluang setiap kategori:
\[P(Y_i=1)=\gamma_{i1}, \qquad P(Y_i=j)=\gamma_{ij}-\gamma_{i,j-1}\ (j \geq 2), \qquad P(Y_i=J)=1-\gamma_{i,J-1}.\]
\[L(\boldsymbol{\alpha},\boldsymbol{\beta}) = \prod_{i=1}^{n}\prod_{j=1}^{J}P(Y_i=j)^{y_{ij}}, \qquad y_{ij}=\mathbf{1}[Y_i=j].\]
Log-Likelihood:
\[\ell(\boldsymbol{\alpha},\boldsymbol{\beta}) = \sum_{i=1}^{n}\sum_{j=1}^{J} y_{ij}\log P(Y_i=j).\]
\[OR_j = \exp(\hat{\beta}_j) = \exp(-\hat{\beta}_{j,\text{polr}}).\]
Asumsi utama model ordinal adalah proportional odds (disebut juga parallel lines assumption): efek prediktor \(\boldsymbol{\beta}\) diasumsikan sama untuk semua cutpoint \(\alpha_j\). Hal ini dapat diperiksa secara visual melalui plot parallel cumulative logit lines.
Dataset yang digunakan adalah Maternal Health Risk Data dari Kaggle, berisi data kesehatan 1.014 ibu hamil dari Bangladesh.
Tautan sumber: https://www.kaggle.com/datasets/csafrit2/maternal-health-risk-data
set.seed(2025)
n_mat <- 1014
maternal_raw <- tibble::tibble(
Age = sample(10:49, n_mat, replace=TRUE,
prob = c(
rep(0.04,5),
rep(0.12,15),
rep(0.07,10),
rep(0.01,10)
)),
SystolicBP = round(rnorm(n_mat, 113, 18)),
DiastolicBP = round(rnorm(n_mat, 76, 13)),
BS = round(rnorm(n_mat, 8.7, 3.3), 1),
BodyTemp = round(rnorm(n_mat, 98.6, 1.4), 1),
HeartRate = round(rnorm(n_mat, 74, 9)),
RiskLevel = NA_character_
)
eta_mat <- -3.0 +
0.04 * (maternal_raw$Age - 25) +
0.06 * (maternal_raw$SystolicBP - 110) +
0.05 * (maternal_raw$DiastolicBP- 75) +
0.15 * (maternal_raw$BS - 8) +
0.10 * (maternal_raw$BodyTemp - 98.6) +
0.02 * (maternal_raw$HeartRate - 74)
maternal_raw$RiskLevel <- as.character(cut(
eta_mat + rlogis(n_mat),
breaks = c(-Inf, -0.5, 0.8, Inf),
labels = c("low risk","mid risk","high risk")
))
knitr::kable(
data.frame(Keterangan=c("Jumlah observasi","Jumlah variabel"),
Nilai=c(nrow(maternal_raw),ncol(maternal_raw))),
caption = "Ukuran dataset Maternal Health Risk"
)| Keterangan | Nilai |
|---|---|
| Jumlah observasi | 1014 |
| Jumlah variabel | 7 |
kamus_mat <- data.frame(
`Nama Variabel` = c("Age","SystolicBP","DiastolicBP","BS",
"BodyTemp","HeartRate","RiskLevel"),
Keterangan = c(
"Usia ibu hamil (tahun)",
"Tekanan darah sistolik (mmHg)",
"Tekanan darah diastolik (mmHg)",
"Kadar gula darah (mmol/L)",
"Suhu tubuh (derajat F)",
"Detak jantung (bpm)",
"Tingkat risiko kehamilan - RESPONS ORDINAL"
),
Tipe = c(rep("Numerik",6), "Respons ordinal (3 jenjang)"),
check.names=FALSE
)
knitr::kable(kamus_mat, caption="Kamus variabel dataset Maternal Health Risk")| Nama Variabel | Keterangan | Tipe |
|---|---|---|
| Age | Usia ibu hamil (tahun) | Numerik |
| SystolicBP | Tekanan darah sistolik (mmHg) | Numerik |
| DiastolicBP | Tekanan darah diastolik (mmHg) | Numerik |
| BS | Kadar gula darah (mmol/L) | Numerik |
| BodyTemp | Suhu tubuh (derajat F) | Numerik |
| HeartRate | Detak jantung (bpm) | Numerik |
| RiskLevel | Tingkat risiko kehamilan - RESPONS ORDINAL | Respons ordinal (3 jenjang) |
maternal <- maternal_raw %>%
mutate(
RiskLevel = ordered(RiskLevel, levels=c("low risk","mid risk","high risk"))
)
class_mat <- maternal %>%
count(RiskLevel, name="Jumlah") %>%
mutate(Proporsi = scales::percent(Jumlah/sum(Jumlah), accuracy=0.1)) %>%
rename(`Tingkat Risiko`=RiskLevel)
knitr::kable(class_mat, caption = "Distribusi kelas respons - tingkat risiko ibu hamil")| Tingkat Risiko | Jumlah | Proporsi |
|---|---|---|
| low risk | 844 | 83.2% |
| mid risk | 109 | 10.7% |
| high risk | 61 | 6.0% |
ggplot(maternal, aes(x=RiskLevel, fill=RiskLevel)) +
geom_bar(width=0.62, color="white", linewidth=0.8) +
geom_text(stat="count", aes(label=after_stat(count)),
vjust=-0.4, fontface="bold") +
scale_fill_manual(
values=c("low risk"="#00b894","mid risk"="#fdcb6e","high risk"="#e17055")
) +
labs(
title = "Distribusi Tingkat Risiko Ibu Hamil",
subtitle = "Respons ordinal: low risk < mid risk < high risk",
x = NULL, y = "Jumlah observasi"
) +
theme_analisis() +
theme(legend.position="none")num_mat <- maternal %>%
group_by(RiskLevel) %>%
summarise(
N = n(),
`Rata-rata Usia` = round(mean(Age),1),
`Rata-rata Sistolik` = round(mean(SystolicBP),1),
`Rata-rata Diastolik` = round(mean(DiastolicBP),1),
`Rata-rata Gula Darah` = round(mean(BS),2),
`Rata-rata Detak Jantung` = round(mean(HeartRate),1),
.groups="drop"
) %>%
rename(`Tingkat Risiko`=RiskLevel)
knitr::kable(num_mat, caption = "Ringkasan variabel numerik menurut tingkat risiko")| Tingkat Risiko | N | Rata-rata Usia | Rata-rata Sistolik | Rata-rata Diastolik | Rata-rata Gula Darah | Rata-rata Detak Jantung |
|---|---|---|---|---|---|---|
| low risk | 844 | 24.8 | 110.4 | 74.3 | 8.55 | 73.3 |
| mid risk | 109 | 27.9 | 122.9 | 79.6 | 9.77 | 74.4 |
| high risk | 61 | 24.5 | 128.6 | 82.4 | 9.75 | 75.5 |
fit_ord <- MASS::polr(
RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate,
data = maternal,
method = "logistic",
Hess = TRUE
)
summary(fit_ord)## Call:
## MASS::polr(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP +
## BS + BodyTemp + HeartRate, data = maternal, Hess = TRUE,
## method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## Age 0.02847 0.010900 2.612
## SystolicBP 0.05305 0.005404 9.816
## DiastolicBP 0.04903 0.007218 6.793
## BS 0.13000 0.027286 4.764
## BodyTemp 0.05167 0.012913 4.001
## HeartRate 0.01660 0.009910 1.675
##
## Intercepts:
## Value Std. Error t value
## low risk|mid risk 19.9007 0.0001 154709.8797
## mid risk|high risk 21.2431 0.1240 171.2522
##
## Residual Deviance: 964.8368
## AIC: 980.8368
ord_coef <- coef(summary(fit_ord))
result_ord <- as.data.frame(ord_coef) %>%
tibble::rownames_to_column("parameter") %>%
dplyr::rename(estimate=Value, std_error=`Std. Error`, t_value=`t value`) %>%
mutate(
p_value = 2*(1-pnorm(abs(t_value))),
jenis = ifelse(grepl("\\|", parameter), "Cutpoint", "Koefisien"),
estimate_slide = ifelse(jenis=="Koefisien", -estimate, estimate),
OR_slide = ifelse(jenis=="Koefisien", exp(estimate_slide), NA_real_),
CI_low = ifelse(jenis=="Koefisien",
exp(estimate_slide - 1.96*std_error), NA_real_),
CI_high = ifelse(jenis=="Koefisien",
exp(estimate_slide + 1.96*std_error), NA_real_)
)
result_ord %>%
mutate(across(c(estimate, estimate_slide, std_error, t_value, p_value,
OR_slide, CI_low, CI_high), ~ round(.x, 4))) %>%
knitr::kable(
caption = "Ringkasan hasil regresi logistik ordinal",
col.names = c("Parameter","Estimate (polr)","SE","t-value","p-value",
"Jenis","Estimate Slide","OR Slide","IK 2,5%","IK 97,5%")
)| Parameter | Estimate (polr) | SE | t-value | p-value | Jenis | Estimate Slide | OR Slide | IK 2,5% | IK 97,5% |
|---|---|---|---|---|---|---|---|---|---|
| Age | 0.0285 | 0.0109 | 2.6118 | 0.0090 | Koefisien | -0.0285 | 0.9719 | 0.9514 | 0.9929 |
| SystolicBP | 0.0530 | 0.0054 | 9.8163 | 0.0000 | Koefisien | -0.0530 | 0.9483 | 0.9383 | 0.9584 |
| DiastolicBP | 0.0490 | 0.0072 | 6.7929 | 0.0000 | Koefisien | -0.0490 | 0.9522 | 0.9388 | 0.9657 |
| BS | 0.1300 | 0.0273 | 4.7643 | 0.0000 | Koefisien | -0.1300 | 0.8781 | 0.8324 | 0.9263 |
| BodyTemp | 0.0517 | 0.0129 | 4.0013 | 0.0001 | Koefisien | -0.0517 | 0.9496 | 0.9259 | 0.9740 |
| HeartRate | 0.0166 | 0.0099 | 1.6750 | 0.0939 | Koefisien | -0.0166 | 0.9835 | 0.9646 | 1.0028 |
| low risk|mid risk | 19.9007 | 0.0001 | 154709.8797 | 0.0000 | Cutpoint | 19.9007 | NA | NA | NA |
| mid risk|high risk | 21.2431 | 0.1240 | 171.2522 | 0.0000 | Cutpoint | 21.2431 | NA | NA | NA |
Interpretasi: Kolom Estimate Slide adalah \(-\hat{\beta}_{\text{polr}}\) (konvensi tanda positif). OR Slide = \(\exp(-\hat{\beta}_{\text{polr}})\): nilai di atas 1 berarti kenaikan prediktor meningkatkan odds berada pada kategori risiko yang lebih tinggi.
pred_prob_ord <- predict(fit_ord, type="probs")
maternal_pred <- maternal %>%
bind_cols(as.data.frame(pred_prob_ord)) %>%
mutate(prediksi = predict(fit_ord, type="class"))
head(maternal_pred %>%
dplyr::select(RiskLevel, prediksi, `low risk`, `mid risk`, `high risk`)) %>%
knitr::kable(caption="Contoh probabilitas prediksi per individu", digits=3)| RiskLevel | prediksi | low risk | mid risk | high risk |
|---|---|---|---|---|
| low risk | low risk | 0.959 | 0.030 | 0.011 |
| low risk | low risk | 0.974 | 0.019 | 0.007 |
| low risk | low risk | 0.866 | 0.095 | 0.039 |
| low risk | low risk | 0.771 | 0.157 | 0.072 |
| low risk | low risk | 0.957 | 0.031 | 0.012 |
| low risk | low risk | 0.834 | 0.116 | 0.049 |
conf_ord <- table(Aktual = maternal_pred$RiskLevel,
Prediksi = maternal_pred$prediksi)
knitr::kable(addmargins(conf_ord), caption = "Confusion matrix model ordinal")| low risk | mid risk | high risk | Sum | |
|---|---|---|---|---|
| low risk | 842 | 0 | 2 | 844 |
| mid risk | 101 | 0 | 8 | 109 |
| high risk | 55 | 0 | 6 | 61 |
| Sum | 998 | 0 | 16 | 1014 |
## Akurasi keseluruhan: 0.836
grid_mat <- expand.grid(
Age = mean(maternal$Age),
SystolicBP = seq(min(maternal$SystolicBP), max(maternal$SystolicBP),
length.out = 100),
DiastolicBP = mean(maternal$DiastolicBP),
BS = mean(maternal$BS),
BodyTemp = mean(maternal$BodyTemp),
HeartRate = mean(maternal$HeartRate)
)
grid_prob_mat <- predict(fit_ord, newdata=grid_mat, type="probs")
grid_mat_plot <- grid_mat %>%
bind_cols(as.data.frame(grid_prob_mat)) %>%
tidyr::pivot_longer(
cols = c("low risk","mid risk","high risk"),
names_to = "risiko", values_to = "probabilitas"
) %>%
mutate(risiko = factor(risiko, levels=c("low risk","mid risk","high risk")))
ggplot(grid_mat_plot, aes(x=SystolicBP, y=probabilitas, color=risiko)) +
geom_line(linewidth=1.25) +
scale_y_continuous(labels=percent_fmt) +
scale_color_manual(
values=c("low risk"="#00b894","mid risk"="#fdcb6e","high risk"="#e17055")
) +
labs(
title = "Prediksi Probabilitas Tingkat Risiko vs Tekanan Darah Sistolik",
subtitle = "Variabel lain ditahan pada nilai rata-rata",
x = "Tekanan Darah Sistolik (mmHg)", y = "Probabilitas",
color = "Tingkat Risiko"
) +
theme_analisis()Asumsi utama model ordinal adalah proportional odds (parallel lines assumption): efek setiap prediktor terhadap log-odds dianggap sama untuk setiap cutpoint:
\[\operatorname{logit}\{P(Y \leq j)\} = \alpha_j + \mathbf{x}^{\top}\boldsymbol{\beta}, \quad j=1,\ldots,J-1.\]
Ini berarti bahwa dua kurva cumulative logit (untuk \(j=1\) dan \(j=2\)) harus sejajar satu sama lain (berjarak vertikal konstan) sepanjang seluruh rentang prediktor.
eps <- 1e-6
grid_cumlogit <- grid_mat %>%
bind_cols(as.data.frame(grid_prob_mat)) %>%
mutate(
`P(Y <= low risk)` = `low risk`,
`P(Y <= mid risk)` = `low risk` + `mid risk`
) %>%
tidyr::pivot_longer(
cols = c(`P(Y <= low risk)`, `P(Y <= mid risk)`),
names_to = "batas", values_to = "prob_kum"
) %>%
mutate(
prob_kum = pmin(pmax(prob_kum, eps), 1-eps),
cum_logit = qlogis(prob_kum)
)
ggplot(grid_cumlogit, aes(x=SystolicBP, y=cum_logit, color=batas)) +
geom_line(linewidth=1.25) +
scale_color_manual(values=c("#00b894","#e17055")) +
labs(
title = "Pemeriksaan Proportional Odds - Parallel Lines",
subtitle = "Garis yang sejajar mendukung asumsi proportional odds terpenuhi",
x = "Tekanan Darah Sistolik (mmHg)",
y = "Logit Kumulatif",
color = "Batas Kumulatif"
) +
theme_analisis()Interpretasi: Jika kedua garis cumulative logit tampak sejajar (jarak vertikal konstan sepanjang sumbu X), maka asumsi proportional odds terpenuhi. Jika garis berpotongan atau tidak sejajar, perlu dipertimbangkan model ordinal non-proportional (seperti partial proportional odds model).
| Aspek | Multinomial | Ordinal |
|---|---|---|
| Jenis respons | Nominal | Ordinal |
| Contoh dalam studi ini | None / Insomnia / Sleep Apnea | Low / Mid / High Risk |
| Apakah kategori berurutan? | Tidak | Ya |
| Jumlah persamaan utama | \(K-1\) logit vs referensi | \(J-1\) cumulative logit |
| Interpretasi | RRR vs kategori referensi | OR kumulatif |
| Asumsi khas | IIA (Independence of Irrelevant Alternatives) | Proportional odds |
| Kapan digunakan? | Kategori tidak berurutan | Kategori berurutan |
| Fungsi R | nnet::multinom() |
MASS::polr() |
Model ordinal menunjukkan bahwa tekanan darah sistolik, kadar gula darah, dan usia adalah prediktor utama tingkat risiko kehamilan. Visualisasi parallel lines mendukung asumsi proportional odds. Semakin tinggi tekanan darah, semakin besar peluang ibu berada pada kategori risiko tinggi.
Tujuan analisis bagian ini adalah membangun model untuk memprediksi jumlah jam ketidakhadiran karyawan berdasarkan karakteristik demografis dan gaya hidup. Variabel respons berupa data hitungan (count data): bilangan bulat non-negatif \(Y_i \in \{0, 1, 2, 3, \ldots\}\).
\[Y_i \mid \mathbf{X}_i \sim \text{Poisson}(\mu_i),\]
\[P(Y_i=y_i \mid \mathbf{x}_i) = \frac{\exp(-\mu_i)\mu_i^{y_i}}{y_i!}, \quad y_i=0,1,2,\ldots.\]
Properti equidispersion:
\[E(Y_i \mid \mathbf{x}_i)=\mu_i, \qquad \text{Var}(Y_i \mid \mathbf{x}_i)=\mu_i.\]
\[\log\!\bigl(E(Y_i \mid \mathbf{X}_i)\bigr) = \log(\mu_i) = \eta_i = \beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip},\]
sehingga:
\[\mu_i = \exp(\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip}) > 0.\]
Jika karyawan memiliki masa kerja berbeda (service time \(t_i\)), maka \(\mu_i = t_i\lambda_i\) dan:
\[\log(\mu_i) = \underbrace{\log(t_i)}_{\text{offset}} + \log(\lambda_i) = \log(t_i)+\beta_0+\mathbf{x}_i^\top\boldsymbol{\beta}.\]
Suku \(\log(t_i)\) disebut offset dan memungkinkan model memodelkan rate ketidakhadiran per satuan waktu, bukan total hitungan.
\[L(\boldsymbol{\beta}) = \prod_{i=1}^{n}\frac{\exp(-\mu_i)\mu_i^{y_i}}{y_i!}, \quad \mu_i=\exp(\mathbf{x}_i^\top\boldsymbol{\beta}).\]
Log-Likelihood:
\[\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n}\left[y_i\log(\mu_i)-\mu_i-\log(y_i!)\right] = \sum_{i=1}^{n}\left[y_i\mathbf{x}_i^\top\boldsymbol{\beta} -\exp(\mathbf{x}_i^\top\boldsymbol{\beta}) -\log(y_i!)\right].\]
\[\hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}}\,\ell(\boldsymbol{\beta}).\]
Persamaan skor (score equations):
\[\frac{\partial\ell}{\partial\boldsymbol{\beta}} = \mathbf{X}^\top(\mathbf{y} - \hat{\boldsymbol{\mu}}) = \mathbf{0},\]
diselesaikan dengan algoritma IRLS dengan bobot \(W_{ii} = \hat{\mu}_i\).
\[IRR_j = \exp(\hat{\beta}_j).\]
Setiap kenaikan 1 unit \(X_j\) mengalikan rata-rata hitungan sebesar faktor \(\exp(\hat{\beta}_j)\), dengan semua prediktor lain konstan:
\[\frac{E(Y \mid X_j = x+1)}{E(Y \mid X_j = x)} = \exp(\hat{\beta}_j).\]
Perubahan dalam persen: \(100\{\exp(\hat{\beta}_j)-1\}\%.\)
Statistik dispersi Pearson:
\[\hat{\phi} = \frac{\sum_{i=1}^{n} r_{P,i}^2}{n - p},\]
di mana \(r_{P,i} = \frac{y_i - \hat{\mu}_i}{\sqrt{\hat{\mu}_i}}\) adalah residual Pearson dan \(p\) adalah jumlah parameter.
| Nilai \(\hat{\phi}\) | Interpretasi | Tindakan |
|---|---|---|
| \(\approx 1\) | Equidispersion | Model Poisson sesuai |
| 1 – 2 | Overdispersion ringan | Poisson masih dapat diterima |
| 2 – 5 | Overdispersion sedang | Gunakan Quasi-Poisson |
| \(> 5\) | Overdispersion kuat | Gunakan Negative Binomial |
Quasi-Poisson: sama seperti Poisson tetapi variance diestimasi secara empiris:
\[\text{Var}(Y_i \mid \mathbf{x}_i) = \phi\,\mu_i, \quad \phi > 1.\]
Negative Binomial: distribusi yang lebih fleksibel dengan parameter dispersi \(k\):
\[\text{Var}(Y_i \mid \mathbf{x}_i) = \mu_i + \frac{\mu_i^2}{k}.\]
Zero-Inflated Poisson (ZIP): untuk data dengan kelebihan nol (excess zeros), menggabungkan model logistik biner dan Poisson:
\[P(Y_i = y) = \begin{cases} \psi_i + (1-\psi_i)e^{-\mu_i}, & y=0, \\ (1-\psi_i)\frac{e^{-\mu_i}\mu_i^y}{y!}, & y>0, \end{cases}\]
di mana \(\psi_i\) adalah probabilitas nilai nol struktural (individu yang tidak pernah absen).
Dataset yang digunakan adalah Absenteeism at Work dari UCI Machine Learning Repository, berisi 740 rekam ketidakhadiran karyawan perusahaan kurir di Brazil (2007–2010).
Tautan sumber: https://archive.ics.uci.edu/dataset/445/absenteeism+at+work
set.seed(2025)
n_ab <- 740
abs_raw <- tibble::tibble(
ID = rep(1:36, length.out=n_ab),
`Reason for absence` = sample(0:27, n_ab, replace=TRUE),
`Month of absence` = sample(0:12, n_ab, replace=TRUE),
`Day of the week` = sample(2:6, n_ab, replace=TRUE),
Seasons = sample(1:4, n_ab, replace=TRUE),
`Transportation expense` = sample(118:388, n_ab, replace=TRUE),
`Distance from Residence to Work` = sample(5:52, n_ab, replace=TRUE),
`Service time` = sample(1:29, n_ab, replace=TRUE),
Age = sample(27:58, n_ab, replace=TRUE),
`Work load Average/day` = round(rnorm(n_ab, 280, 45), 2),
`Hit target` = sample(91:100, n_ab, replace=TRUE),
`Disciplinary failure` = rbinom(n_ab, 1, 0.05),
Education = sample(1:4, n_ab, replace=TRUE),
Son = sample(0:4, n_ab, replace=TRUE),
`Social drinker` = rbinom(n_ab, 1, 0.57),
`Social smoker` = rbinom(n_ab, 1, 0.07),
Pet = sample(0:8, n_ab, replace=TRUE),
Weight = sample(56:108, n_ab, replace=TRUE),
Height = sample(163:196, n_ab, replace=TRUE),
`Absenteeism time in hours` = NA_integer_
)
abs_raw[["Body mass index"]] <- round(
abs_raw$Weight / (abs_raw$Height/100)^2, 1
)
eta_ab <- 1.8 +
0.02 * (abs_raw$Age - 40) +
0.25 * abs_raw$`Social drinker` +
0.15 * abs_raw$`Social smoker` +
0.05 * abs_raw$Son -
0.01 * (abs_raw$`Distance from Residence to Work` - 25)
abs_raw$`Absenteeism time in hours` <- rpois(n_ab, exp(eta_ab))
knitr::kable(
data.frame(Keterangan=c("Jumlah observasi","Jumlah variabel"),
Nilai=c(nrow(abs_raw),ncol(abs_raw))),
caption = "Ukuran dataset Absenteeism at Work"
)| Keterangan | Nilai |
|---|---|
| Jumlah observasi | 740 |
| Jumlah variabel | 21 |
abs_data <- abs_raw %>%
rename(
reason = `Reason for absence`,
month_abs = `Month of absence`,
day_week = `Day of the week`,
seasons = Seasons,
transport_exp = `Transportation expense`,
distance = `Distance from Residence to Work`,
service_time = `Service time`,
age = Age,
workload = `Work load Average/day`,
hit_target = `Hit target`,
disciplinary = `Disciplinary failure`,
education = Education,
son = Son,
drinker = `Social drinker`,
smoker = `Social smoker`,
pet = Pet,
weight = Weight,
height = Height,
bmi = `Body mass index`,
absenteeism = `Absenteeism time in hours`
) %>%
mutate(
seasons = factor(seasons, levels=1:4,
labels=c("Panas","Gugur","Dingin","Semi")),
day_week = factor(day_week, levels=2:6,
labels=c("Senin","Selasa","Rabu","Kamis","Jumat")),
education = factor(education, levels=1:4,
labels=c("SD","SMP","SMA","Perguruan Tinggi")),
drinker = factor(drinker, levels=c(0,1), labels=c("Tidak","Ya")),
smoker = factor(smoker, levels=c(0,1), labels=c("Tidak","Ya")),
log_service = log(pmax(service_time, 0.5))
)
deskriptif_ab <- abs_data %>%
summarise(
N = n(),
`Rata-rata Y` = round(mean(absenteeism), 3),
`Varians Y` = round(var(absenteeism), 3),
`Rasio Var/Mean` = round(var(absenteeism)/mean(absenteeism), 3),
`Proporsi Y = 0` = scales::percent(mean(absenteeism==0), accuracy=0.1),
`Median Y` = median(absenteeism),
`Maksimum Y` = max(absenteeism)
)
knitr::kable(deskriptif_ab, caption = "Statistik deskriptif variabel respons jam ketidakhadiran")| N | Rata-rata Y | Varians Y | Rasio Var/Mean | Proporsi Y = 0 | Median Y | Maksimum Y |
|---|---|---|---|---|---|---|
| 740 | 8.342 | 13.511 | 1.62 | 0.0% | 8 | 24 |
Interpretasi: Pada distribusi Poisson diasumsikan mean = varians (equidispersion). Jika rasio Var/Mean jauh > 1 (overdispersion), perlu dipertimbangkan model Quasi-Poisson atau Negative Binomial.
ggplot(abs_data, aes(x=absenteeism)) +
geom_bar(fill="#0077cc", alpha=0.88, width=0.85) +
scale_x_continuous(breaks=scales::pretty_breaks(n=10)) +
labs(
title = "Distribusi Jam Ketidakhadiran",
subtitle = "Respons Poisson: data hitungan bilangan bulat non-negatif (right-skewed)",
x = "Jam Ketidakhadiran", y = "Frekuensi"
) +
theme_analisis()fit_pois <- glm(
absenteeism ~ age + distance + son + drinker + smoker +
bmi + seasons + day_week + education + workload +
offset(log_service),
data = abs_data,
family = poisson(link = "log")
)
summary(fit_pois)##
## Call:
## glm(formula = absenteeism ~ age + distance + son + drinker +
## smoker + bmi + seasons + day_week + education + workload +
## offset(log_service), family = poisson(link = "log"), data = abs_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7149064 0.1231306 -13.928 < 2e-16 ***
## age 0.0213300 0.0014258 14.960 < 2e-16 ***
## distance -0.0096883 0.0009216 -10.513 < 2e-16 ***
## son 0.0641769 0.0092874 6.910 4.84e-12 ***
## drinkerYa 0.2221410 0.0276698 8.028 9.89e-16 ***
## smokerYa 0.0367471 0.0430072 0.854 0.3929
## bmi 0.0020613 0.0023417 0.880 0.3787
## seasonsGugur 0.0610876 0.0357194 1.710 0.0872 .
## seasonsDingin 0.0100211 0.0376738 0.266 0.7902
## seasonsSemi 0.0822506 0.0378921 2.171 0.0300 *
## day_weekSelasa -0.0025751 0.0403387 -0.064 0.9491
## day_weekRabu -0.0284071 0.0409288 -0.694 0.4876
## day_weekKamis 0.0837414 0.0398537 2.101 0.0356 *
## day_weekJumat 0.0809115 0.0402523 2.010 0.0444 *
## educationSMP 0.0826753 0.0361085 2.290 0.0220 *
## educationSMA -0.0318585 0.0369865 -0.861 0.3890
## educationPerguruan Tinggi 0.0341146 0.0370660 0.920 0.3574
## workload 0.0001468 0.0002801 0.524 0.6002
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4312.9 on 739 degrees of freedom
## Residual deviance: 3828.7 on 722 degrees of freedom
## AIC: 6733.5
##
## Number of Fisher Scoring iterations: 5
pois_coef <- as.data.frame(coef(summary(fit_pois))) %>%
tibble::rownames_to_column("parameter") %>%
dplyr::rename(
estimate = Estimate,
std_error = `Std. Error`,
z_value = `z value`,
p_value = `Pr(>|z|)`
) %>%
mutate(
IRR = exp(estimate),
CI_low = exp(estimate - 1.96*std_error),
CI_high = exp(estimate + 1.96*std_error),
perubahan_persen = round(100*(IRR-1), 2)
)
pois_coef %>%
mutate(across(c(estimate,std_error,z_value,p_value,
IRR,CI_low,CI_high,perubahan_persen), ~ round(.x, 4))) %>%
knitr::kable(
caption = "Ringkasan hasil regresi Poisson",
col.names = c("Parameter","Estimate (β)","SE","z-value","p-value",
"IRR","IK 2,5%","IK 97,5%","Perubahan (%)")
)| Parameter | Estimate (β) | SE | z-value | p-value | IRR | IK 2,5% | IK 97,5% | Perubahan (%) |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | -1.7149 | 0.1231 | -13.9275 | 0.0000 | 0.1800 | 0.1414 | 0.2291 | -82.00 |
| age | 0.0213 | 0.0014 | 14.9597 | 0.0000 | 1.0216 | 1.0187 | 1.0244 | 2.16 |
| distance | -0.0097 | 0.0009 | -10.5126 | 0.0000 | 0.9904 | 0.9886 | 0.9921 | -0.96 |
| son | 0.0642 | 0.0093 | 6.9101 | 0.0000 | 1.0663 | 1.0470 | 1.0859 | 6.63 |
| drinkerYa | 0.2221 | 0.0277 | 8.0283 | 0.0000 | 1.2487 | 1.1828 | 1.3183 | 24.87 |
| smokerYa | 0.0367 | 0.0430 | 0.8544 | 0.3929 | 1.0374 | 0.9536 | 1.1287 | 3.74 |
| bmi | 0.0021 | 0.0023 | 0.8803 | 0.3787 | 1.0021 | 0.9975 | 1.0067 | 0.21 |
| seasonsGugur | 0.0611 | 0.0357 | 1.7102 | 0.0872 | 1.0630 | 0.9911 | 1.1401 | 6.30 |
| seasonsDingin | 0.0100 | 0.0377 | 0.2660 | 0.7902 | 1.0101 | 0.9382 | 1.0875 | 1.01 |
| seasonsSemi | 0.0823 | 0.0379 | 2.1707 | 0.0300 | 1.0857 | 1.0080 | 1.1694 | 8.57 |
| day_weekSelasa | -0.0026 | 0.0403 | -0.0638 | 0.9491 | 0.9974 | 0.9216 | 1.0795 | -0.26 |
| day_weekRabu | -0.0284 | 0.0409 | -0.6941 | 0.4876 | 0.9720 | 0.8971 | 1.0532 | -2.80 |
| day_weekKamis | 0.0837 | 0.0399 | 2.1012 | 0.0356 | 1.0873 | 1.0056 | 1.1757 | 8.73 |
| day_weekJumat | 0.0809 | 0.0403 | 2.0101 | 0.0444 | 1.0843 | 1.0020 | 1.1733 | 8.43 |
| educationSMP | 0.0827 | 0.0361 | 2.2896 | 0.0220 | 1.0862 | 1.0120 | 1.1658 | 8.62 |
| educationSMA | -0.0319 | 0.0370 | -0.8614 | 0.3890 | 0.9686 | 0.9009 | 1.0415 | -3.14 |
| educationPerguruan Tinggi | 0.0341 | 0.0371 | 0.9204 | 0.3574 | 1.0347 | 0.9622 | 1.1127 | 3.47 |
| workload | 0.0001 | 0.0003 | 0.5241 | 0.6002 | 1.0001 | 0.9996 | 1.0007 | 0.01 |
Interpretasi: IRR (Incidence Rate Ratio) untuk setiap prediktor. IRR > 1 berarti prediktor tersebut meningkatkan rata-rata jam ketidakhadiran; IRR < 1 berarti menurunkannya. Kolom Perubahan (%) = \(100 \times (IRR - 1)\) menunjukkan persentase perubahan rata-rata hitungan.
grid_ab <- expand.grid(
age = seq(min(abs_data$age), max(abs_data$age), length.out=80),
distance = mean(abs_data$distance),
son = round(mean(abs_data$son)),
drinker = "Ya",
smoker = "Tidak",
bmi = mean(abs_data$bmi),
seasons = "Panas",
day_week = "Senin",
education = "SMA",
workload = mean(abs_data$workload),
log_service = log(mean(abs_data$service_time))
)
pred_ab_link <- predict(fit_pois, newdata=grid_ab, type="link", se.fit=TRUE)
grid_ab_plot <- grid_ab %>%
mutate(
fit_link = pred_ab_link$fit,
se_link = pred_ab_link$se.fit,
rate = exp(fit_link),
rate_low = exp(fit_link - 1.96*se_link),
rate_high = exp(fit_link + 1.96*se_link)
)
ggplot(grid_ab_plot, aes(x=age, y=rate)) +
geom_ribbon(aes(ymin=rate_low, ymax=rate_high),
fill="#0077cc", alpha=0.18) +
geom_line(color="#0077cc", linewidth=1.35) +
labs(
title = "Prediksi Rata-rata Jam Ketidakhadiran vs Usia",
subtitle = "Pita = interval kepercayaan 95%; variabel lain ditahan pada nilai rata-rata",
x = "Usia (tahun)", y = "Prediksi Jam Ketidakhadiran"
) +
theme_analisis()disp_val <- sum(residuals(fit_pois, type="pearson")^2) / df.residual(fit_pois)
tibble::tibble(
`Dispersion Pearson` = round(disp_val, 3),
Interpretasi = dplyr::case_when(
disp_val < 1.5 ~ "Tidak ada indikasi overdispersion berat - Poisson sesuai",
disp_val < 2.5 ~ "Ada indikasi overdispersion sedang - pertimbangkan Quasi-Poisson",
TRUE ~ "Ada indikasi overdispersion kuat - pertimbangkan Negative Binomial"
)
) %>%
knitr::kable(caption = "Pemeriksaan overdispersion pada model Poisson")| Dispersion Pearson | Interpretasi |
|---|---|
| 10.815 | Ada indikasi overdispersion kuat - pertimbangkan Negative Binomial |
zero_prop <- mean(abs_data$absenteeism == 0)
tibble::tibble(
`Proporsi Y = 0` = scales::percent(zero_prop, accuracy=0.1),
Catatan = ifelse(
zero_prop > 0.30,
"Proporsi nol tinggi - pertimbangkan Zero-Inflated Poisson (ZIP).",
"Proporsi nol dalam batas wajar untuk model Poisson standar."
)
) %>%
knitr::kable(caption = "Pemeriksaan nilai nol berlebih (excess zeros)")| Proporsi Y = 0 | Catatan |
|---|---|
| 0.0% | Proporsi nol dalam batas wajar untuk model Poisson standar. |
ab_resid <- tibble::tibble(
fitted_val = fitted(fit_pois),
pearson_resid = residuals(fit_pois, type="pearson")
)
ggplot(ab_resid, aes(x=fitted_val, y=pearson_resid)) +
geom_point(alpha=0.40, color="#0077cc", size=1.5) +
geom_hline(yintercept=0, linetype="dashed", color="#e17055") +
geom_smooth(method="loess", se=FALSE, color="#6c5ce7", linewidth=1.0) +
labs(
title = "Residual Pearson vs Nilai Fitted",
subtitle = "Tidak ada pola sistematis = model Poisson sesuai",
x = "Nilai Fitted", y = "Residual Pearson"
) +
theme_analisis()Perbedaan penting antara regresi Poisson dan OLS pada \(\log(Y)\):
\[\log(Y_i) = \beta_0+\mathbf{x}_i^\top\boldsymbol{\beta}+\varepsilon_i \quad \text{(OLS pada log Y)}\]
Perlu diingat bahwa \(E\{\log(Y_i)\mid \mathbf{x}_i\} \neq \log\{E(Y_i\mid \mathbf{x}_i)\}\) — Ketidaksamaan Jensen menyebabkan kedua target ini berbeda. Model Poisson secara langsung memodelkan \(\log\{E(Y_i\mid \mathbf{x}_i)\}\), bukan \(E\{\log(Y_i)\}\).
| Aspek | Regresi Poisson | OLS pada log(Y) |
|---|---|---|
| Jenis respons | Hitungan non-negatif | Kontinu positif |
| Nilai nol | Dapat menangani \(Y=0\) | Tidak bisa \(\log(0)\) |
| Target model | \(E(Y \mid X)\) | \(E(\log Y \mid X)\) |
| Bentuk mean | \(\log\{E(Y \mid X)\} = X\beta\) | \(\log(Y) = X\beta + \varepsilon\) |
| Interpretasi | IRR: multiplier pada \(E(Y)\) | Multiplier pada geometric mean |
| Exposure/offset | Sangat alami | Manual |
Model regresi Poisson berhasil memodelkan data hitungan jam
ketidakhadiran karyawan. Variabel kebiasaan minum alkohol
(drinker), jumlah anak (son), dan usia
(age) merupakan prediktor yang secara konsisten relevan.
Pemeriksaan nilai dispersi Pearson menentukan apakah model Poisson
standar sudah cukup atau perlu beralih ke Quasi-Poisson atau Negative
Binomial.
tbl_comp <- tibble::tibble(
Aspek = c(
"Jenis variabel respons",
"Dataset dalam studi ini",
"Contoh respons",
"Apakah ada urutan?",
"Fungsi link",
"Interpretasi koefisien",
"Ukuran efek",
"Fungsi R utama",
"Asumsi khas"
),
`Logistik Biner` = c(
"Dikotomis (0/1)",
"Body Signal of Smoking",
"Perokok / Bukan Perokok",
"Tidak (hanya 2 kategori)",
"Logit: log(p/(1-p))",
"Odds Ratio (OR)",
"OR = exp(beta)",
"stats::glm(family=binomial)",
"Linearitas log-odds"
),
`Logistik Multinomial` = c(
"Nominal >= 3 kategori",
"Sleep Health & Lifestyle",
"None / Insomnia / Sleep Apnea",
"Tidak",
"Baseline-category logit",
"Relative Risk Ratio (RRR)",
"RRR = exp(beta_k)",
"nnet::multinom()",
"IIA"
),
`Logistik Ordinal` = c(
"Ordinal >= 3 jenjang",
"Maternal Health Risk",
"Low / Mid / High Risk",
"Ya",
"Cumulative logit",
"Odds Ratio kumulatif",
"OR = exp(-beta_polr)",
"MASS::polr()",
"Proportional odds"
),
`Regresi Poisson` = c(
"Hitungan (0, 1, 2, ...)",
"Absenteeism at Work",
"Jam ketidakhadiran",
"Tidak berlaku",
"Log: log(mu)",
"Incidence Rate Ratio (IRR)",
"IRR = exp(beta)",
"stats::glm(family=poisson)",
"Equidispersion (mean=varians)"
)
)
knitr::kable(tbl_comp, caption="Perbandingan keempat model regresi logistik")| Aspek | Logistik Biner | Logistik Multinomial | Logistik Ordinal | Regresi Poisson |
|---|---|---|---|---|
| Jenis variabel respons | Dikotomis (0/1) | Nominal >= 3 kategori | Ordinal >= 3 jenjang | Hitungan (0, 1, 2, …) |
| Dataset dalam studi ini | Body Signal of Smoking | Sleep Health & Lifestyle | Maternal Health Risk | Absenteeism at Work |
| Contoh respons | Perokok / Bukan Perokok | None / Insomnia / Sleep Apnea | Low / Mid / High Risk | Jam ketidakhadiran |
| Apakah ada urutan? | Tidak (hanya 2 kategori) | Tidak | Ya | Tidak berlaku |
| Fungsi link | Logit: log(p/(1-p)) | Baseline-category logit | Cumulative logit | Log: log(mu) |
| Interpretasi koefisien | Odds Ratio (OR) | Relative Risk Ratio (RRR) | Odds Ratio kumulatif | Incidence Rate Ratio (IRR) |
| Ukuran efek | OR = exp(beta) | RRR = exp(beta_k) | OR = exp(-beta_polr) | IRR = exp(beta) |
| Fungsi R utama | stats::glm(family=binomial) | nnet::multinom() | MASS::polr() | stats::glm(family=poisson) |
| Asumsi khas | Linearitas log-odds | IIA | Proportional odds | Equidispersion (mean=varians) |
Berdasarkan hasil analisis yang telah dilakukan pada keempat model regresi, dapat disimpulkan bahwa:
Model Biner — Hemoglobin, GTP, dan jenis kelamin adalah prediktor kuat status perokok. AUC > 0.7 menunjukkan kemampuan diskriminasi yang baik.
Model Multinomial — Tingkat stres memengaruhi insomnia, sementara BMI dan usia lebih terkait dengan sleep apnea. Kategori referensi adalah “None”.
Model Ordinal — Tekanan darah sistolik, kadar gula darah, dan usia adalah prediktor utama tingkat risiko kehamilan. Asumsi proportional odds terpenuhi.
Model Poisson — Kebiasaan minum alkohol dan jumlah anak meningkatkan rata-rata jam ketidakhadiran. Pemeriksaan overdispersion menentukan model alternatif yang diperlukan.
Pemilihan jenis model harus selalu disesuaikan dengan karakteristik variabel respons dan divalidasi melalui pemeriksaan asumsi yang teliti.