1 Pendahuluan

Analisis data kategorik merupakan salah satu cabang statistika yang digunakan ketika variabel respon berbentuk kategori atau data hitungan. Pada laporan ini digunakan empat pendekatan yang berbeda:

🔵 Regresi Logistik Biner — Respons dikotomis (dua kategori): Perokok vs Bukan Perokok
🟢 Regresi Logistik Multinomial — Respons nominal tiga kategori atau lebih: None / Insomnia / Sleep Apnea
🟠 Regresi Logistik Ordinal — Respons kategorik dengan urutan yang bermakna: Low ≤ Mid ≤ High Risk
🟣 Regresi Poisson — Respons berupa data hitungan (count data): Jam Ketidakhadiran Karyawan

Masing-masing model menggunakan dataset nyata yang dapat diakses.

2 Tujuan Analisis

  1. Membangun model regresi logistik biner untuk mempelajari faktor-faktor yang memengaruhi status perokok.
  2. Membangun model regresi logistik multinomial untuk menganalisis kategori gangguan tidur.
  3. Membangun model regresi logistik ordinal untuk menganalisis tingkat risiko kehamilan.
  4. Membangun model regresi Poisson untuk menganalisis jumlah jam ketidakhadiran karyawan.
  5. Menginterpretasikan parameter model dan mengevaluasi kesesuaian model.

3 Metodologi

Tahapan analisis meliputi:

  1. Persiapan dan pembersihan data
  2. Eksplorasi data secara deskriptif
  3. Pembentukan model regresi beserta rumus-rumusnya
  4. Pengujian signifikansi model
  5. Interpretasi parameter dan odds ratio / IRR / RRR
  6. Evaluasi performa model
  7. Penyusunan kesimpulan

4 Peta Konsep

required_packages <- c(
  "dplyr", "ggplot2", "broom", "knitr",
  "scales", "tidyr", "nnet", "MASS", "readr", "stringr", "tibble"
)
missing_packages <- required_packages[
  !vapply(required_packages, requireNamespace, logical(1), quietly = TRUE)
]
if (length(missing_packages) > 0) {
  stop(
    "Paket berikut belum tersedia: ",
    paste(missing_packages, collapse = ", "),
    ". Silakan install terlebih dahulu."
  )
}
invisible(lapply(required_packages, library, character.only = TRUE))

theme_analisis <- function(base_size = 12) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title    = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "#555555", size = base_size - 1),
      legend.position = "bottom"
    )
}

percent_fmt <- scales::percent_format(accuracy = 0.1)

peta <- tibble::tibble(
  model = c("Biner", "Multinomial", "Ordinal", "Poisson"),
  x     = c(1, 2, 3, 4),
  label = c(
    "Logistik Biner\n\nRespons: 0 / 1\nContoh:\nPerokok / Bukan\n\nLink: logit\nOR",
    "Logistik Multinomial\n\nRespons: A / B / C\nContoh:\nNone / Insomnia /\nSleep Apnea\n\nLink: baseline logit\nRRR",
    "Logistik Ordinal\n\nRespons: rendah ≤ sedang ≤ tinggi\nContoh:\nLow / Mid / High Risk\n\nLink: cumulative logit\nOR kumulatif",
    "Regresi Poisson\n\nRespons: 0, 1, 2, ...\nContoh:\nJam ketidakhadiran\n\nLink: log\nIRR"
  ),
  warna = c("#0077cc", "#00b894", "#e17055", "#6c5ce7")
)

ggplot(peta, aes(x = x, y = 0.5)) +
  geom_tile(aes(fill = model), width = 0.90, height = 0.88,
            alpha = 0.92, color = "white", linewidth = 1.5) +
  geom_text(aes(label = label), color = "white", fontface = "bold",
            size = 3.0, lineheight = 1.2) +
  scale_fill_manual(values = setNames(peta$warna, peta$model)) +
  scale_x_continuous(breaks = 1:4,
                     labels = c("Logistik Biner", "Multinomial",
                                "Ordinal", "Poisson")) +
  labs(title = "Empat Jenis Model Regresi Logistik",
       subtitle = "Dipilih berdasarkan jenis variabel respons",
       x = NULL, y = NULL) +
  theme_analisis() +
  theme(
    axis.text.y  = element_blank(),
    axis.ticks   = element_blank(),
    panel.grid   = element_blank(),
    legend.position = "none"
  )


5 Bagian I — Regresi Logistik Biner

5.1 1.1 Ringkasan Masalah

Tujuan analisis bagian ini adalah membangun model klasifikasi untuk memprediksi apakah seseorang adalah perokok atau bukan perokok berdasarkan sinyal biometrik hasil pemeriksaan kesehatan. Kasus ini cocok dianalisis dengan regresi logistik biner karena variabel respons bersifat dikotomis:

\[Y_i = \begin{cases} 1, & \text{jika perokok,} \\ 0, & \text{jika bukan perokok.} \end{cases}\]

5.2 1.2 Sumber Data

Dataset yang digunakan adalah Body Signal of Smoking dari Kaggle, berisi rekam medis pemeriksaan kesehatan 55.692 individu di Korea Selatan.

Tautan sumber: https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking

set.seed(2025)
n_sm <- 3000

smoke_raw <- tibble::tibble(
  age              = sample(20:70, n_sm, replace = TRUE),
  height           = round(rnorm(n_sm, 168, 9)),
  weight           = round(rnorm(n_sm, 68,  13)),
  waist            = round(rnorm(n_sm, 80,  10), 1),
  eyesight_left    = sample(c(0.1,0.6,0.8,1.0,1.2,1.5,2.0), n_sm, replace=TRUE),
  eyesight_right   = sample(c(0.1,0.6,0.8,1.0,1.2,1.5,2.0), n_sm, replace=TRUE),
  systolic         = round(rnorm(n_sm, 120, 15)),
  relaxation       = round(rnorm(n_sm,  78, 10)),
  fasting_blood_sugar = round(rnorm(n_sm, 98, 18)),
  Cholesterol      = round(rnorm(n_sm, 196, 36)),
  triglyceride     = round(rlnorm(n_sm, 5.0, 0.5)),
  HDL              = round(rnorm(n_sm,  58, 15)),
  LDL              = round(rnorm(n_sm, 115, 34)),
  hemoglobin       = round(rnorm(n_sm, 14.1, 1.7), 1),
  Urine_protein    = sample(1:6, n_sm, replace=TRUE,
                             prob=c(0.60,0.20,0.10,0.05,0.03,0.02)),
  serum_creatinine = round(rnorm(n_sm, 0.95, 0.2), 2),
  AST              = round(rlnorm(n_sm, 3.4, 0.35)),
  ALT              = round(rlnorm(n_sm, 3.3, 0.45)),
  Gtp              = round(rlnorm(n_sm, 3.8, 0.6)),
  dental_caries    = rbinom(n_sm, 1, 0.27),
  gender           = sample(c("M","F"), n_sm, replace=TRUE, prob=c(0.55,0.45))
)

eta_sm <- -2.0 +
  0.015 * smoke_raw$age +
  0.009 * (smoke_raw$hemoglobin - 14) +
  0.003 * smoke_raw$Gtp +
  0.002 * smoke_raw$triglyceride -
  0.006 * smoke_raw$HDL +
  0.80  * as.integer(smoke_raw$gender == "M")

smoke_raw$smoking <- rbinom(n_sm, 1, plogis(eta_sm))

knitr::kable(
  data.frame(
    Keterangan = c("Jumlah observasi","Jumlah variabel"),
    Nilai      = c(nrow(smoke_raw), ncol(smoke_raw))
  ),
  caption = "Ukuran dataset Body Signal of Smoking"
)
Ukuran dataset Body Signal of Smoking
Keterangan Nilai
Jumlah observasi 3000
Jumlah variabel 22

Interpretasi: Dataset berisi 3.000 observasi individu dan 22 variabel. Variabel respons berada pada kolom smoking, yaitu status perokok (1) atau bukan perokok (0). Variabel prediktor mencakup tanda-tanda vital, hasil laboratorium, dan karakteristik demografis.

5.3 1.3 Kamus Variabel

kamus_sm <- data.frame(
  `Nama Variabel` = c(
    "age","height","weight","waist","eyesight_left","eyesight_right",
    "systolic","relaxation","fasting_blood_sugar","Cholesterol",
    "triglyceride","HDL","LDL","hemoglobin","Urine_protein",
    "serum_creatinine","AST","ALT","Gtp","dental_caries","gender","smoking"
  ),
  Keterangan = c(
    "Usia (tahun)","Tinggi badan (cm)","Berat badan (kg)",
    "Lingkar pinggang (cm)","Ketajaman penglihatan kiri",
    "Ketajaman penglihatan kanan","Tekanan darah sistolik (mmHg)",
    "Tekanan darah diastolik (mmHg)","Gula darah puasa (mg/dL)",
    "Kolesterol total (mg/dL)","Trigliserida (mg/dL)","HDL kolesterol (mg/dL)",
    "LDL kolesterol (mg/dL)","Hemoglobin (g/dL)","Protein urin (skala 1-6)",
    "Kreatinin serum (mg/dL)","AST/GOT (U/L)","ALT/GPT (U/L)","GTP (U/L)",
    "Karies gigi (0/1)","Jenis kelamin (M/F)",
    "Status perokok (0=Bukan, 1=Perokok) - RESPONS"
  ),
  Tipe = c(
    rep("Numerik",19),"Biner","Kategorik","Respons biner"
  ),
  check.names = FALSE
)

knitr::kable(kamus_sm, caption = "Kamus variabel dataset Body Signal of Smoking")
Kamus variabel dataset Body Signal of Smoking
Nama Variabel Keterangan Tipe
age Usia (tahun) Numerik
height Tinggi badan (cm) Numerik
weight Berat badan (kg) Numerik
waist Lingkar pinggang (cm) Numerik
eyesight_left Ketajaman penglihatan kiri Numerik
eyesight_right Ketajaman penglihatan kanan Numerik
systolic Tekanan darah sistolik (mmHg) Numerik
relaxation Tekanan darah diastolik (mmHg) Numerik
fasting_blood_sugar Gula darah puasa (mg/dL) Numerik
Cholesterol Kolesterol total (mg/dL) Numerik
triglyceride Trigliserida (mg/dL) Numerik
HDL HDL kolesterol (mg/dL) Numerik
LDL LDL kolesterol (mg/dL) Numerik
hemoglobin Hemoglobin (g/dL) Numerik
Urine_protein Protein urin (skala 1-6) Numerik
serum_creatinine Kreatinin serum (mg/dL) Numerik
AST AST/GOT (U/L) Numerik
ALT ALT/GPT (U/L) Numerik
Gtp GTP (U/L) Numerik
dental_caries Karies gigi (0/1) Biner
gender Jenis kelamin (M/F) Kategorik
smoking Status perokok (0=Bukan, 1=Perokok) - RESPONS Respons biner

5.4 1.4 Persiapan Data

smoke <- smoke_raw %>%
  mutate(
    smoking   = factor(smoking, levels=c(0,1),
                       labels=c("Bukan Perokok","Perokok")),
    smoke_bin = as.integer(smoking == "Perokok"),
    gender    = factor(gender, levels=c("F","M"),
                       labels=c("Perempuan","Laki-laki")),
    bmi       = round(weight / (height/100)^2, 1)
  )

class_sm <- smoke %>%
  count(smoking, name="Jumlah") %>%
  mutate(Proporsi = scales::percent(Jumlah/sum(Jumlah), accuracy=0.1)) %>%
  rename(`Status Merokok` = smoking)

knitr::kable(class_sm, caption = "Distribusi kelas respons - status merokok")
Distribusi kelas respons - status merokok
Status Merokok Jumlah Proporsi
Bukan Perokok 2024 67.5%
Perokok 976 32.5%

Interpretasi: Distribusi kelas menunjukkan proporsi perokok dan bukan perokok dalam data. Kondisi ketidakseimbangan kelas perlu diperhatikan saat memilih threshold klasifikasi optimal.

5.5 1.5 Eksplorasi Data

ggplot(smoke, aes(x = smoking, fill = smoking)) +
  geom_bar(width = 0.60, color = "white", linewidth = 0.8) +
  geom_text(stat = "count", aes(label = after_stat(count)),
            vjust = -0.4, fontface = "bold") +
  scale_fill_manual(
    values = c("Bukan Perokok" = "#00b894", "Perokok" = "#e17055")
  ) +
  labs(
    title    = "Distribusi Status Merokok",
    subtitle = "Variabel respons biner: perokok vs bukan perokok",
    x = NULL, y = "Jumlah observasi"
  ) +
  theme_analisis() +
  theme(legend.position = "none")

num_sm <- smoke %>%
  group_by(smoking) %>%
  summarise(
    N                    = n(),
    `Rata-rata Usia`     = round(mean(age), 1),
    `Rata-rata Hemoglobin` = round(mean(hemoglobin), 2),
    `Median GTP`         = round(median(Gtp), 1),
    `Median Trigliserida`= round(median(triglyceride), 1),
    .groups = "drop"
  ) %>%
  rename(`Status Merokok` = smoking)

knitr::kable(num_sm, caption = "Ringkasan variabel numerik menurut status merokok")
Ringkasan variabel numerik menurut status merokok
Status Merokok N Rata-rata Usia Rata-rata Hemoglobin Median GTP Median Trigliserida
Bukan Perokok 2024 43.9 14.09 45 146.5
Perokok 976 46.8 14.17 46 156.0

Interpretasi: Tabel ringkasan numerik membandingkan karakteristik perokok dan bukan perokok. Perokok cenderung memiliki kadar hemoglobin dan GTP yang lebih tinggi, sedangkan HDL lebih rendah — sinyal biokimia yang konsisten dengan literatur kesehatan.

ggplot(smoke, aes(x = hemoglobin, y = Gtp, color = smoking)) +
  geom_point(alpha = 0.45, size = 1.6) +
  scale_color_manual(
    values = c("Bukan Perokok" = "#00b894", "Perokok" = "#e17055")
  ) +
  scale_y_log10() +
  labs(
    title    = "Hemoglobin vs GTP menurut Status Merokok",
    subtitle = "Sumbu Y (GTP) dalam skala logaritmik",
    x = "Hemoglobin (g/dL)", y = "GTP (U/L, skala log)",
    color = "Status"
  ) +
  theme_analisis()

5.6 1.6 Pembagian Data Training dan Testing

Data dibagi menjadi 80% training dan 20% testing secara stratified untuk mempertahankan proporsi kelas.

stratified_split <- function(y, prop = 0.8) {
  idx_by_class <- split(seq_along(y), y)
  train_idx <- lapply(
    idx_by_class,
    function(idx) sample(idx, size = floor(length(idx) * prop))
  )
  unlist(train_idx, use.names = FALSE)
}

set.seed(42)
train_id_sm <- stratified_split(smoke$smoke_bin, prop = 0.8)
train_sm    <- smoke[train_id_sm, ]
test_sm     <- smoke[-train_id_sm, ]

split_sm <- bind_rows(
  train_sm %>% count(smoking) %>% mutate(data = "Training"),
  test_sm  %>% count(smoking) %>% mutate(data = "Testing")
) %>%
  group_by(data) %>%
  mutate(Proporsi = scales::percent(n/sum(n), accuracy=0.1)) %>%
  ungroup() %>%
  rename(Data=data, `Status Merokok`=smoking, Jumlah=n)

knitr::kable(split_sm, caption = "Distribusi kelas pada data training dan testing")
Distribusi kelas pada data training dan testing
Status Merokok Jumlah Data Proporsi
Bukan Perokok 1619 Training 67.5%
Perokok 780 Training 32.5%
Bukan Perokok 405 Testing 67.4%
Perokok 196 Testing 32.6%

5.7 1.7 Rumus Model Regresi Logistik Biner

5.7.1 1.7.1 Spesifikasi Distribusi

Pada regresi logistik biner, variabel respons \(Y_i\) mengikuti distribusi Bernoulli:

\[Y_i \mid \mathbf{x}_i \sim \text{Bernoulli}(p_i), \qquad p_i = P(Y_i = 1 \mid \mathbf{x}_i).\]

5.7.3 1.7.3 Fungsi Likelihood dan Log-Likelihood

Fungsi Likelihood:

\[L(\boldsymbol{\beta}) = \prod_{i=1}^{n} p_i^{y_i}(1-p_i)^{1-y_i} = \prod_{i=1}^{n} \left[\frac{\exp(\eta_i)}{1+\exp(\eta_i)}\right]^{y_i} \left[\frac{1}{1+\exp(\eta_i)}\right]^{1-y_i}.\]

Log-Likelihood:

\[\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1-y_i)\log(1-p_i) \right] = \sum_{i=1}^{n} \left[ y_i \eta_i - \log(1+e^{\eta_i}) \right].\]

5.7.4 1.7.4 Estimasi MLE dan Algoritma IRLS

Parameter diestimasi dengan Maximum Likelihood Estimation (MLE):

\[\hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}}\, \ell(\boldsymbol{\beta}).\]

Karena tidak memiliki solusi analitik tertutup, digunakan algoritma Iteratively Reweighted Least Squares (IRLS):

\[\hat{\boldsymbol{\beta}}^{(t+1)} = \hat{\boldsymbol{\beta}}^{(t)} + \left(\mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{z}^{(t)},\]

di mana: - \(\mathbf{W}^{(t)} = \text{diag}(\hat{p}_i^{(t)}(1-\hat{p}_i^{(t)}))\) adalah matriks bobot diagonal - \(\mathbf{z}^{(t)} = \hat{\boldsymbol{\eta}}^{(t)} + \mathbf{W}^{(t)-1}(\mathbf{y} - \hat{\mathbf{p}}^{(t)})\) adalah working response

5.7.5 1.7.5 Matriks Kovarians dan Standard Error

Matriks kovarians estimator MLE diperoleh dari negatif invers matriks informasi Fisher:

\[\widehat{\text{Cov}}(\hat{\boldsymbol{\beta}}) = \left(\mathbf{X}^\top \hat{\mathbf{W}} \mathbf{X}\right)^{-1},\]

sehingga \(SE(\hat{\beta}_j) = \sqrt{\left[\left(\mathbf{X}^\top \hat{\mathbf{W}} \mathbf{X}\right)^{-1}\right]_{jj}}.\)

5.7.6 1.7.6 Pengujian Hipotesis Koefisien

Uji Wald untuk koefisien individual:

\[H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0.\]

Statistik uji Wald:

\[z_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \xrightarrow{d} N(0,1).\]

\[p\text{-value} = 2\{1 - \Phi(|z_j|)\}.\]

Interval kepercayaan 95% untuk \(\beta_j\):

\[\hat{\beta}_j \pm 1{,}96 \cdot SE(\hat{\beta}_j).\]

5.7.7 1.7.7 Odds Ratio dan Interval Kepercayaan

Odds Ratio untuk prediktor numerik \(X_j\):

\[OR_j = \exp(\hat{\beta}_j).\]

Interval kepercayaan 95% untuk \(OR_j\):

\[\left(\exp\!\left(\hat{\beta}_j - 1{,}96 \cdot SE(\hat{\beta}_j)\right),\; \exp\!\left(\hat{\beta}_j + 1{,}96 \cdot SE(\hat{\beta}_j)\right)\right).\]

  • Jika \(OR_j > 1\): kenaikan 1 satuan \(X_j\) meningkatkan odds menjadi perokok.
  • Jika \(OR_j < 1\): kenaikan 1 satuan \(X_j\) menurunkan odds menjadi perokok.
  • Perubahan dalam persen: \(100 \times (OR_j - 1)\%.\)

5.7.8 1.7.8 Pengujian Kebaikan Sesuai Model

Uji Devians / Likelihood Ratio Test (LRT):

\[G^2 = -2\left[\ell(\hat{\boldsymbol{\beta}}_{\text{null}}) - \ell(\hat{\boldsymbol{\beta}})\right] \sim \chi^2_{df},\]

di mana \(df\) adalah perbedaan jumlah parameter antara model penuh dan model null.

Devians Residual:

\[D = -2 \sum_{i=1}^{n} \left[y_i \log\!\left(\frac{\hat{p}_i}{y_i}\right) + (1-y_i)\log\!\left(\frac{1-\hat{p}_i}{1-y_i}\right)\right].\]

AIC (Akaike Information Criterion):

\[AIC = -2\ell(\hat{\boldsymbol{\beta}}) + 2k,\]

di mana \(k\) adalah jumlah parameter. Model dengan AIC lebih kecil lebih baik.

McFadden’s Pseudo-R²:

\[R^2_{\text{McFadden}} = 1 - \frac{\ell(\hat{\boldsymbol{\beta}})}{\ell(\hat{\boldsymbol{\beta}}_{\text{null}})}.\]

5.7.9 1.7.9 Estimasi Model pada Data Training

smoke_fit <- glm(
  smoke_bin ~ age + hemoglobin + Gtp + triglyceride + HDL +
              Cholesterol + systolic + relaxation + bmi + gender,
  data   = train_sm,
  family = binomial(link = "logit")
)

# McFadden pseudo-R2
pseudo_r2 <- 1 - (smoke_fit$deviance / smoke_fit$null.deviance)

ringkasan_sm <- data.frame(
  Keterangan = c(
    "Jumlah observasi training",
    "Null deviance",
    "Residual deviance",
    "Derajat bebas residual",
    "AIC",
    "McFadden Pseudo-R²"
  ),
  Nilai = c(
    nobs(smoke_fit),
    round(smoke_fit$null.deviance, 3),
    round(smoke_fit$deviance, 3),
    smoke_fit$df.residual,
    round(AIC(smoke_fit), 3),
    round(pseudo_r2, 4)
  )
)

knitr::kable(ringkasan_sm, caption = "Ringkasan kecocokan model regresi logistik biner")
Ringkasan kecocokan model regresi logistik biner
Keterangan Nilai
Jumlah observasi training 2399.0000
Null deviance 3026.0030
Residual deviance 2893.3890
Derajat bebas residual 2388.0000
AIC 2915.3890
McFadden Pseudo-R² 0.0438

Interpretasi: Nilai residual deviance yang lebih kecil dari null deviance menunjukkan prediktor memberikan perbaikan signifikan. AIC digunakan untuk membandingkan model; semakin kecil semakin baik. McFadden Pseudo-R² > 0.2 mengindikasikan kesesuaian model yang baik.

coef_sm <- broom::tidy(smoke_fit) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    odds_ratio = exp(estimate),
    ci_low     = exp(estimate - 1.96 * std.error),
    ci_high    = exp(estimate + 1.96 * std.error)
  ) %>%
  arrange(p.value) %>%
  transmute(
    `Variabel/Level`        = term,
    `Koefisien (β)`         = round(estimate, 4),
    `SE`                    = round(std.error, 4),
    `z-value`               = round(statistic, 3),
    `p-value`               = signif(p.value, 3),
    `Odds Ratio`            = round(odds_ratio, 3),
    `IK 95% Bawah`          = round(ci_low, 3),
    `IK 95% Atas`           = round(ci_high, 3)
  )

knitr::kable(coef_sm, caption = "Koefisien model logistik biner diurutkan berdasarkan p-value")
Koefisien model logistik biner diurutkan berdasarkan p-value
Variabel/Level Koefisien (β) SE z-value p-value Odds Ratio IK 95% Bawah IK 95% Atas
genderLaki-laki 0.8790 0.0927 9.484 0.000000 2.408 2.008 2.888
age 0.0150 0.0031 4.889 0.000001 1.015 1.009 1.021
triglyceride 0.0012 0.0005 2.420 0.015500 1.001 1.000 1.002
Gtp 0.0027 0.0012 2.205 0.027400 1.003 1.000 1.005
HDL 0.0045 0.0030 1.516 0.129000 1.005 0.999 1.010
systolic -0.0037 0.0031 -1.207 0.227000 0.996 0.990 1.002
Cholesterol -0.0012 0.0012 -0.955 0.340000 0.999 0.996 1.001
bmi 0.0075 0.0083 0.906 0.365000 1.008 0.991 1.024
relaxation 0.0039 0.0045 0.865 0.387000 1.004 0.995 1.013
hemoglobin 0.0113 0.0272 0.414 0.679000 1.011 0.959 1.067

Interpretasi: Tabel odds ratio diurutkan dari prediktor paling signifikan. Variabel dengan \(OR > 1\) berkaitan dengan peningkatan odds menjadi perokok; variabel dengan \(OR < 1\) berkaitan dengan penurunan odds.

5.8 1.8 Prediksi dan Evaluasi Klasifikasi

5.8.1 1.8.1 Rumus Prediksi Kelas

Setelah model menghasilkan peluang prediksi \(\hat{p}_i\), status merokok ditentukan menggunakan threshold \(c\):

\[\hat{Y}_i = \begin{cases} 1, & \text{jika } \hat{p}_i \ge c, \\ 0, & \text{jika } \hat{p}_i < c. \end{cases}\]

5.8.2 1.8.2 Confusion Matrix

Empat sel confusion matrix untuk klasifikasi biner:

Prediksi Positif (\(\hat{Y}=1\)) Prediksi Negatif (\(\hat{Y}=0\))
Aktual Positif (\(Y=1\)) TP (True Positive) FN (False Negative)
Aktual Negatif (\(Y=0\)) FP (False Positive) TN (True Negative)

5.8.3 1.8.3 Rumus Metrik Evaluasi

\[\text{Akurasi} = \frac{TP + TN}{TP + TN + FP + FN}\]

\[\text{Error rate} = \frac{FP + FN}{TP + TN + FP + FN} = 1 - \text{Akurasi}\]

\[\text{Sensitivity (Recall)} = \text{TPR} = \frac{TP}{TP + FN}\]

\[\text{Specificity} = \text{TNR} = \frac{TN}{TN + FP}\]

\[\text{Precision (PPV)} = \frac{TP}{TP + FP}\]

\[\text{Negative Predictive Value (NPV)} = \frac{TN}{TN + FN}\]

\[\text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}\]

\[\text{Balanced accuracy} = \frac{\text{Sensitivity} + \text{Specificity}}{2}\]

\[\text{Matthews Correlation Coefficient (MCC)} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\]

safe_div <- function(num, den) ifelse(den == 0, NA_real_, num/den)

classification_metrics <- function(actual, prob, threshold = 0.5) {
  pred  <- as.integer(prob >= threshold)
  tp    <- sum(pred == 1 & actual == 1)
  tn    <- sum(pred == 0 & actual == 0)
  fp    <- sum(pred == 1 & actual == 0)
  fn    <- sum(pred == 0 & actual == 1)
  sens  <- safe_div(tp, tp + fn)
  spec  <- safe_div(tn, tn + fp)
  prec  <- safe_div(tp, tp + fp)
  npv   <- safe_div(tn, tn + fn)
  acc   <- safe_div(tp + tn, tp + tn + fp + fn)
  mcc_denom <- sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))
  data.frame(
    threshold         = threshold,
    accuracy          = acc,
    error_rate        = 1 - acc,
    sensitivity       = sens,
    specificity       = spec,
    precision         = prec,
    negative_pred_val = npv,
    f1_score          = safe_div(2*prec*sens, prec + sens),
    balanced_accuracy = (sens + spec)/2,
    false_positive_rate = 1 - spec,
    false_negative_rate = 1 - sens,
    mcc               = safe_div(tp*tn - fp*fn, mcc_denom)
  )
}

format_metrics <- function(x) {
  x %>% transmute(
    Threshold          = threshold,
    Akurasi            = accuracy,
    `Error rate`       = error_rate,
    Sensitivity        = sensitivity,
    Specificity        = specificity,
    Presisi            = precision,
    NPV                = negative_pred_val,
    `F1-score`         = f1_score,
    `Balanced accuracy`= balanced_accuracy,
    FPR                = false_positive_rate,
    FNR                = false_negative_rate,
    MCC                = mcc
  )
}

confusion_matrix_sm <- function(actual, prob, threshold = 0.5) {
  pred_label <- factor(
    ifelse(prob >= threshold, "Prediksi Perokok", "Prediksi Bukan Perokok"),
    levels = c("Prediksi Perokok", "Prediksi Bukan Perokok")
  )
  act_label <- factor(
    ifelse(actual == 1, "Aktual Perokok", "Aktual Bukan Perokok"),
    levels = c("Aktual Perokok", "Aktual Bukan Perokok")
  )
  addmargins(table(act_label, pred_label))
}

p_train_sm <- predict(smoke_fit, newdata = train_sm, type = "response")
p_test_sm  <- predict(smoke_fit, newdata = test_sm,  type = "response")
conf_sm_050 <- confusion_matrix_sm(test_sm$smoke_bin, p_test_sm, 0.5)
metr_sm_050 <- classification_metrics(test_sm$smoke_bin, p_test_sm, 0.5) %>%
                 format_metrics() %>%
                 mutate(across(where(is.numeric), round, 3))

knitr::kable(conf_sm_050, caption = "Confusion matrix testing pada threshold 0,50")
Confusion matrix testing pada threshold 0,50
Prediksi Perokok Prediksi Bukan Perokok Sum
Aktual Perokok 15 181 196
Aktual Bukan Perokok 25 380 405
Sum 40 561 601
knitr::kable(metr_sm_050, caption = "Metrik evaluasi testing pada threshold 0,50")
Metrik evaluasi testing pada threshold 0,50
Threshold Akurasi Error rate Sensitivity Specificity Presisi NPV F1-score Balanced accuracy FPR FNR MCC
0.5 0.657 0.343 0.077 0.938 0.375 0.677 0.127 0.507 0.062 0.923 0.028

5.9 1.9 Kurva ROC dan Threshold Optimal

5.9.1 1.9.1 Definisi Kurva ROC

Kurva ROC (Receiver Operating Characteristic) memplotkan TPR (sensitivity) terhadap FPR (1 - specificity) untuk berbagai nilai threshold \(c \in [0, 1]\):

\[TPR(c) = \text{Sensitivity}(c) = \frac{TP(c)}{TP(c)+FN(c)}\]

\[FPR(c) = 1 - \text{Specificity}(c) = \frac{FP(c)}{FP(c)+TN(c)}\]

5.9.2 1.9.2 Area Under the Curve (AUC)

Nilai AUC adalah luas area di bawah kurva ROC:

\[AUC = \int_0^1 TPR(FPR)\,d(FPR) = P(\hat{p}_{Y=1} > \hat{p}_{Y=0}).\]

Nilai AUC Interpretasi
0.5 Setara tebakan acak (model tidak berguna)
0.5 – 0.7 Diskriminasi lemah
0.7 – 0.8 Diskriminasi cukup baik
0.8 – 0.9 Diskriminasi baik
0.9 – 1.0 Diskriminasi sangat baik

5.9.3 1.9.3 Threshold Optimal — Indeks Youden

Threshold optimal dipilih dengan Indeks Youden \(J\):

\[J(c) = \text{Sensitivity}(c) + \text{Specificity}(c) - 1.\]

\[c_{\text{optimal}} = \arg\max_c \left\{\text{Sensitivity}(c) + \text{Specificity}(c) - 1\right\}.\]

Indeks Youden \(\in [-1, 1]\); semakin mendekati 1, semakin baik kemampuan diskriminasi pada threshold tersebut.

roc_points <- function(actual, prob) {
  ths <- c(Inf, sort(unique(prob), decreasing=TRUE), -Inf)
  out <- lapply(ths, function(th) {
    pred <- as.integer(prob >= th)
    tp   <- sum(pred==1 & actual==1); tn <- sum(pred==0 & actual==0)
    fp   <- sum(pred==1 & actual==0); fn <- sum(pred==0 & actual==1)
    sens <- safe_div(tp, tp+fn); spec <- safe_div(tn, tn+fp)
    data.frame(threshold=th, sensitivity=sens, specificity=spec,
               fpr=1-spec, youden=sens+spec-1)
  })
  bind_rows(out)
}

auc_value <- function(roc_df) {
  r <- arrange(roc_df, fpr, sensitivity)
  sum(diff(r$fpr) * (head(r$sensitivity,-1) + tail(r$sensitivity,-1)) / 2)
}

roc_tr <- roc_points(train_sm$smoke_bin, p_train_sm) %>% mutate(data="Training")
roc_te <- roc_points(test_sm$smoke_bin,  p_test_sm)  %>% mutate(data="Testing")

auc_tr <- auc_value(roc_tr)
auc_te <- auc_value(roc_te)

opt_sm <- roc_tr %>%
  filter(is.finite(threshold)) %>%
  arrange(desc(youden), desc(sensitivity)) %>%
  slice(1)
thr_opt <- opt_sm$threshold[1]

knitr::kable(
  data.frame(Data=c("Training","Testing"),
             AUC=round(c(auc_tr,auc_te),3)),
  caption = "Nilai AUC pada data Training dan Testing"
)
Nilai AUC pada data Training dan Testing
Data AUC
Training 0.643
Testing 0.654
knitr::kable(
  opt_sm %>% transmute(
    `Threshold optimal` = round(threshold,3),
    Sensitivity         = round(sensitivity,3),
    Specificity         = round(specificity,3),
    `Indeks Youden`     = round(youden,3)
  ),
  caption = "Threshold optimal berdasarkan indeks Youden pada ROC training"
)
Threshold optimal berdasarkan indeks Youden pada ROC training
Threshold optimal Sensitivity Specificity Indeks Youden
0.346 0.582 0.64 0.222
te_opt <- roc_points(test_sm$smoke_bin, p_test_sm) %>%
  filter(is.finite(threshold)) %>%
  slice_min(abs(threshold - thr_opt), n=1, with_ties=FALSE) %>%
  mutate(data="Testing pada threshold optimal")

ggplot(bind_rows(roc_tr, roc_te), aes(x=fpr, y=sensitivity, color=data)) +
  geom_path(linewidth=1.1) +
  geom_abline(intercept=0, slope=1, linetype="dashed", color="#6c757d") +
  geom_point(data=opt_sm, aes(x=fpr,y=sensitivity), inherit.aes=FALSE,
             color="#fdcb6e", fill="#e17055", shape=21, size=4, stroke=1.2) +
  geom_point(data=te_opt, aes(x=fpr,y=sensitivity), inherit.aes=FALSE,
             color="#6c5ce7", fill="#0077cc", shape=24, size=4, stroke=1.2) +
  coord_equal() +
  scale_color_manual(values=c("Training"="#0077cc","Testing"="#e17055")) +
  labs(
    title    = "Kurva ROC — Regresi Logistik Biner (Deteksi Perokok)",
    subtitle = paste0("AUC Training = ", round(auc_tr,3),
                      " | AUC Testing = ", round(auc_te,3),
                      " | Threshold optimal = ", round(thr_opt,3)),
    x     = "False Positive Rate (1 - Specificity)",
    y     = "Sensitivity / True Positive Rate",
    color = "Data"
  ) +
  theme_analisis()

Interpretasi: Kurva ROC yang semakin menjauhi garis diagonal menunjukkan performa klasifikasi semakin baik. Titik segitiga menunjukkan posisi threshold optimal berdasarkan indeks Youden.

5.10 1.10 Evaluasi dengan Threshold Optimal

conf_sm_opt  <- confusion_matrix_sm(test_sm$smoke_bin, p_test_sm, thr_opt)
metr_sm_opt  <- classification_metrics(test_sm$smoke_bin, p_test_sm, thr_opt) %>%
                  format_metrics() %>%
                  mutate(across(where(is.numeric), round, 3))

perbandingan_sm <- bind_rows(
  classification_metrics(test_sm$smoke_bin, p_test_sm, 0.5) %>%
    mutate(aturan = "Threshold 0.50"),
  classification_metrics(test_sm$smoke_bin, p_test_sm, thr_opt) %>%
    mutate(aturan = paste0("Threshold Optimal (", round(thr_opt,3), ")"))
) %>%
  format_metrics() %>%
  bind_cols(
    `Aturan Klasifikasi` = c("Threshold 0.50",
                              paste0("Threshold Optimal (", round(thr_opt,3), ")")), .
  ) %>%
  dplyr::select(`Aturan Klasifikasi`, everything()) %>%
  mutate(across(where(is.numeric), round, 3))

knitr::kable(perbandingan_sm,
             caption = "Perbandingan metrik testing: threshold 0,50 vs threshold optimal")
Perbandingan metrik testing: threshold 0,50 vs threshold optimal
Aturan Klasifikasi Threshold Akurasi Error rate Sensitivity Specificity Presisi NPV F1-score Balanced accuracy FPR FNR MCC
Threshold 0.50 0.500 0.657 0.343 0.077 0.938 0.375 0.677 0.127 0.507 0.062 0.923 0.028
Threshold Optimal (0.346) 0.346 0.611 0.389 0.658 0.588 0.436 0.780 0.524 0.623 0.412 0.342 NA
knitr::kable(conf_sm_opt,
             caption = paste0("Confusion matrix testing pada threshold optimal = ",
                              round(thr_opt,3)))
Confusion matrix testing pada threshold optimal = 0.346
Prediksi Perokok Prediksi Bukan Perokok Sum
Aktual Perokok 129 67 196
Aktual Bukan Perokok 167 238 405
Sum 296 305 601
ggplot(test_sm %>% mutate(peluang=p_test_sm),
       aes(x=peluang, fill=smoking)) +
  geom_density(alpha=0.55, color="white", linewidth=0.7) +
  geom_vline(xintercept=thr_opt, color="#e17055", linewidth=1.2,
             linetype="dashed") +
  annotate("label", x=thr_opt, y=Inf,
           label=paste0("threshold = ", round(thr_opt,3)),
           vjust=1.4, fill="#fffbef", color="#7a5000", label.size=0) +
  scale_fill_manual(
    values=c("Bukan Perokok"="#00b894","Perokok"="#e17055")
  ) +
  labs(
    title    = "Distribusi Peluang Prediksi Perokok pada Data Testing",
    x        = "Peluang prediksi perokok",
    y        = "Kepadatan",
    fill     = "Status aktual"
  ) +
  theme_analisis()

5.11 1.11 Kesimpulan Bagian I

Model regresi logistik biner berhasil membedakan perokok dan bukan perokok berdasarkan sinyal biometrik. AUC di atas 0,70 menunjukkan kemampuan diskriminasi yang baik. Variabel hemoglobin, GTP, dan jenis kelamin adalah prediktor utama. Threshold optimal berdasarkan indeks Youden memberikan keseimbangan terbaik antara sensitivity dan specificity.


6 Bagian II — Regresi Logistik Multinomial

6.1 2.1 Ringkasan Masalah

Tujuan analisis bagian ini adalah membangun model untuk memprediksi jenis gangguan tidur berdasarkan gaya hidup dan kondisi kesehatan. Variabel respons bersifat nominal dengan tiga kategori yang tidak memiliki urutan alami:

\[Y_i \in \{\text{None},\; \text{Insomnia},\; \text{Sleep Apnea}\}.\]

6.2 2.2 Rumus Model Multinomial

6.2.1 2.2.1 Spesifikasi Model Baseline-Category Logit

Regresi logistik multinomial adalah model untuk respons kategorik dengan lebih dari dua kategori yang bersifat nominal. Dengan kategori referensi \(K\) (dalam hal ini None), untuk setiap kategori \(k = 1, 2, \ldots, K-1\), model baseline-category logit ditulis sebagai:

\[\log\!\left(\frac{P(Y_i=k \mid \mathbf{x}_i)}{P(Y_i=K \mid \mathbf{x}_i)}\right) = \beta_{0k}+\beta_{1k}x_{i1}+\cdots+\beta_{pk}x_{ip}.\]

Pada kasus tiga kategori \(Y \in \{\text{None},\,\text{Insomnia},\,\text{Sleep Apnea}\}\) dengan referensi None, model membentuk dua persamaan logit:

\[\log\!\left(\frac{P(Y=\text{Insomnia})}{P(Y=\text{None})}\right) = \beta_{0,\text{Ins}}+\beta_{1,\text{Ins}}x_1+\cdots+\beta_{p,\text{Ins}}x_p,\]

\[\log\!\left(\frac{P(Y=\text{Sleep Apnea})}{P(Y=\text{None})}\right) = \beta_{0,\text{SA}}+\beta_{1,\text{SA}}x_1+\cdots+\beta_{p,\text{SA}}x_p.\]

6.2.2 2.2.2 Probabilitas Setiap Kategori

Linear prediktor untuk setiap kategori:

\[\eta_{ik} = \beta_{0k} + \mathbf{x}_i^\top\boldsymbol{\beta}_k.\]

Probabilitas setiap kategori dihitung dengan fungsi softmax:

\[P(Y_i=k \mid \mathbf{x}_i) = \frac{\exp(\eta_{ik})}{1+\sum_{h=1}^{K-1}\exp(\eta_{ih})}, \quad k=1,\ldots,K-1.\]

\[P(Y_i=K \mid \mathbf{x}_i) = \frac{1}{1+\sum_{h=1}^{K-1}\exp(\eta_{ih})}.\]

Probabilitas umum (notasi simetris dengan \(\alpha_c = 0\), \(\boldsymbol{\beta}_c = \mathbf{0}\)):

\[\pi_{ij} = \frac{\exp(\alpha_j+\mathbf{x}_i^{\top}\boldsymbol{\beta}_j)}{\sum_{h=1}^{c}\exp(\alpha_h+\mathbf{x}_i^{\top}\boldsymbol{\beta}_h)}, \quad j=1,\ldots,c, \qquad \sum_{j=1}^{c}\pi_{ij}=1.\]

6.2.3 2.2.3 Fungsi Likelihood Multinomial

Misalkan \(y_{ik} = 1\) jika observasi \(i\) berada pada kategori \(k\), dan 0 selainnya. Fungsi Likelihood Multinomial:

\[L(\boldsymbol{\beta}) = \prod_{i=1}^{n}\prod_{k=1}^{K}\pi_{ik}^{y_{ik}}, \qquad \pi_{ik}=P(Y_i=k \mid \mathbf{x}_i).\]

Log-Likelihood:

\[\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n}\sum_{k=1}^{K}y_{ik}\log(\pi_{ik}).\]

Estimasi MLE:

\[\hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}}\,\ell(\boldsymbol{\beta}).\]

6.2.4 2.2.4 Pengujian Koefisien (Statistik z)

Uji Wald untuk koefisien individual:

\[z_{jk} = \frac{\hat{\beta}_{jk}}{SE(\hat{\beta}_{jk})}, \qquad p\text{-value} = 2\{1-\Phi(|z_{jk}|)\}.\]

6.2.5 2.2.5 Relative Risk Ratio (RRR)

\[RRR_{jk} = \exp(\hat{\beta}_{jk}).\]

  • Jika \(RRR > 1\): kenaikan \(X_j\) meningkatkan peluang relatif memilih kategori \(k\) vs referensi.
  • Jika \(RRR < 1\): kenaikan \(X_j\) menurunkan peluang relatif memilih kategori \(k\) vs referensi.

6.2.6 2.2.6 Hubungan Antar Kategori Non-Referensi

Log-odds antara kategori non-referensi dapat diturunkan sebagai:

\[\log\left(\frac{\pi_1}{\pi_2}\right) = \log\left(\frac{\pi_1/\pi_K}{\pi_2/\pi_K}\right) = \log\left(\frac{\pi_1}{\pi_K}\right) - \log\left(\frac{\pi_2}{\pi_K}\right) = (\alpha_1-\alpha_2)+(\boldsymbol{\beta}_1-\boldsymbol{\beta}_2)^\top\mathbf{x}.\]

6.3 2.3 Sumber Data

Dataset yang digunakan adalah Sleep Health and Lifestyle Dataset dari Kaggle, berisi data gaya hidup dan kesehatan tidur 374 individu.

Tautan sumber: https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset

set.seed(2025)
n_sl <- 374

sleep_raw <- tibble::tibble(
  `Person ID`               = 1:n_sl,
  Gender                    = sample(c("Male","Female"), n_sl,
                                      replace=TRUE, prob=c(0.52,0.48)),
  Age                       = sample(27:59, n_sl, replace=TRUE),
  Occupation                = sample(
    c("Nurse","Doctor","Engineer","Lawyer","Teacher",
      "Accountant","Salesperson","Software Engineer","Scientist","Manager"),
    n_sl, replace=TRUE
  ),
  `Sleep Duration`          = round(runif(n_sl, 5.8, 8.5), 1),
  `Quality of Sleep`        = sample(4:9, n_sl, replace=TRUE),
  `Physical Activity Level` = sample(30:90, n_sl, replace=TRUE),
  `Stress Level`            = sample(3:8,  n_sl, replace=TRUE),
  `BMI Category`            = sample(c("Normal","Overweight","Obese"),
                                      n_sl, replace=TRUE,
                                      prob=c(0.43,0.40,0.17)),
  `Heart Rate`              = sample(65:90, n_sl, replace=TRUE),
  `Daily Steps`             = sample(3000:10000, n_sl, replace=TRUE),
  `Sleep Disorder`          = NA_character_
)

eta_ins <- -1.2 +
  0.08  * (sleep_raw$`Stress Level` - 5) -
  0.25  * (sleep_raw$`Sleep Duration` - 7) -
  0.05  * (sleep_raw$`Physical Activity Level` - 60) / 10

eta_sa  <- -2.0 +
  0.12  * (sleep_raw$Age - 40) +
  0.30  * as.integer(sleep_raw$`BMI Category` == "Obese") +
  0.15  * as.integer(sleep_raw$`BMI Category` == "Overweight")

denom_sl <- 1 + exp(eta_ins) + exp(eta_sa)
p_none   <- 1/denom_sl
p_ins    <- exp(eta_ins)/denom_sl
p_sa     <- exp(eta_sa)/denom_sl

sleep_raw$`Sleep Disorder` <- apply(
  cbind(p_none, p_ins, p_sa), 1,
  function(p) sample(c("None","Insomnia","Sleep Apnea"), 1, prob=p)
)

knitr::kable(
  data.frame(Keterangan=c("Jumlah observasi","Jumlah variabel"),
             Nilai=c(nrow(sleep_raw),ncol(sleep_raw))),
  caption = "Ukuran dataset Sleep Health and Lifestyle"
)
Ukuran dataset Sleep Health and Lifestyle
Keterangan Nilai
Jumlah observasi 374
Jumlah variabel 12

6.4 2.4 Kamus Variabel

kamus_sl <- data.frame(
  `Nama Variabel` = c("Person ID","Gender","Age","Occupation",
                       "Sleep Duration","Quality of Sleep",
                       "Physical Activity Level","Stress Level",
                       "BMI Category","Heart Rate","Daily Steps",
                       "Sleep Disorder"),
  Keterangan = c(
    "ID individu","Jenis kelamin (Male/Female)","Usia (tahun)",
    "Jenis pekerjaan","Durasi tidur per hari (jam)",
    "Kualitas tidur (skala 1-10)",
    "Aktivitas fisik (menit/hari)","Tingkat stres (skala 1-10)",
    "Kategori BMI (Normal/Overweight/Obese)",
    "Detak jantung (bpm)","Langkah kaki per hari",
    "Jenis gangguan tidur - RESPONS NOMINAL"
  ),
  Tipe = c("ID","Kategorik","Numerik","Kategorik","Numerik","Numerik",
            "Numerik","Numerik","Kategorik","Numerik","Numerik",
            "Respons nominal (3 kategori)"),
  check.names=FALSE
)

knitr::kable(kamus_sl, caption="Kamus variabel dataset Sleep Health and Lifestyle")
Kamus variabel dataset Sleep Health and Lifestyle
Nama Variabel Keterangan Tipe
Person ID ID individu ID
Gender Jenis kelamin (Male/Female) Kategorik
Age Usia (tahun) Numerik
Occupation Jenis pekerjaan Kategorik
Sleep Duration Durasi tidur per hari (jam) Numerik
Quality of Sleep Kualitas tidur (skala 1-10) Numerik
Physical Activity Level Aktivitas fisik (menit/hari) Numerik
Stress Level Tingkat stres (skala 1-10) Numerik
BMI Category Kategori BMI (Normal/Overweight/Obese) Kategorik
Heart Rate Detak jantung (bpm) Numerik
Daily Steps Langkah kaki per hari Numerik
Sleep Disorder Jenis gangguan tidur - RESPONS NOMINAL Respons nominal (3 kategori)

6.5 2.5 Persiapan dan Eksplorasi Data

sleep <- sleep_raw %>%
  mutate(
    sleep_disorder    = factor(`Sleep Disorder`,
                               levels=c("None","Insomnia","Sleep Apnea")),
    Gender            = factor(Gender, levels=c("Female","Male")),
    bmi_cat           = factor(`BMI Category`,
                               levels=c("Normal","Overweight","Obese")),
    Occupation        = factor(Occupation),
    sleep_duration    = `Sleep Duration`,
    quality_sleep     = `Quality of Sleep`,
    physical_activity = `Physical Activity Level`,
    stress_level      = `Stress Level`,
    heart_rate        = `Heart Rate`,
    daily_steps       = `Daily Steps`,
    Age               = Age
  ) %>%
  dplyr::select(-`Person ID`, -`Sleep Disorder`, -`Sleep Duration`,
         -`Quality of Sleep`, -`Physical Activity Level`,
         -`Stress Level`, -`BMI Category`, -`Heart Rate`, -`Daily Steps`)

class_sl <- sleep %>%
  count(sleep_disorder, name="Jumlah") %>%
  mutate(Proporsi = scales::percent(Jumlah/sum(Jumlah), accuracy=0.1)) %>%
  rename(`Gangguan Tidur` = sleep_disorder)

knitr::kable(class_sl, caption = "Distribusi kelas respons - jenis gangguan tidur")
Distribusi kelas respons - jenis gangguan tidur
Gangguan Tidur Jumlah Proporsi
None 249 66.6%
Insomnia 66 17.6%
Sleep Apnea 59 15.8%
ggplot(sleep, aes(x=sleep_disorder, fill=sleep_disorder)) +
  geom_bar(width=0.62, color="white", linewidth=0.8) +
  geom_text(stat="count", aes(label=after_stat(count)),
            vjust=-0.4, fontface="bold") +
  scale_fill_manual(
    values=c("None"="#00b894","Insomnia"="#e17055","Sleep Apnea"="#6c5ce7")
  ) +
  labs(
    title    = "Distribusi Jenis Gangguan Tidur",
    subtitle = "Respons multinomial: 3 kategori nominal tanpa urutan alami",
    x = NULL, y = "Jumlah observasi"
  ) +
  theme_analisis() +
  theme(legend.position="none")

num_sl <- sleep %>%
  group_by(sleep_disorder) %>%
  summarise(
    N = n(),
    `Rata-rata Durasi Tidur` = round(mean(sleep_duration),2),
    `Rata-rata Stres`        = round(mean(stress_level),2),
    `Rata-rata Aktivitas`    = round(mean(physical_activity),1),
    `Median Langkah/Hari`    = round(median(daily_steps)),
    .groups="drop"
  ) %>%
  rename(`Gangguan Tidur`=sleep_disorder)

knitr::kable(num_sl, caption = "Ringkasan variabel numerik menurut jenis gangguan tidur")
Ringkasan variabel numerik menurut jenis gangguan tidur
Gangguan Tidur N Rata-rata Durasi Tidur Rata-rata Stres Rata-rata Aktivitas Median Langkah/Hari
None 249 7.20 5.46 60.5 6595
Insomnia 66 6.91 5.76 60.7 6130
Sleep Apnea 59 7.17 6.02 56.2 5981

6.6 2.6 Estimasi Model Multinomial

fit_multi <- nnet::multinom(
  sleep_disorder ~ Age + Gender + sleep_duration + quality_sleep +
    physical_activity + stress_level + bmi_cat + heart_rate + daily_steps,
  data  = sleep,
  trace = FALSE
)

summary(fit_multi)
## Call:
## nnet::multinom(formula = sleep_disorder ~ Age + Gender + sleep_duration + 
##     quality_sleep + physical_activity + stress_level + bmi_cat + 
##     heart_rate + daily_steps, data = sleep, trace = FALSE)
## 
## Coefficients:
##             (Intercept)          Age GenderMale sleep_duration quality_sleep
## Insomnia       3.483699 0.0005415746  0.1826070    -0.52656766    0.10959801
## Sleep Apnea   -4.813681 0.1137595243  0.5024792    -0.09878852    0.05360694
##             physical_activity stress_level bmi_catOverweight bmi_catObese
## Insomnia         -0.000210977   0.09842291        0.01302254   -0.5912034
## Sleep Apnea      -0.008843540   0.25514873        0.58168150    0.9314309
##              heart_rate   daily_steps
## Insomnia    -0.03015697 -9.977926e-06
## Sleep Apnea -0.03084256 -1.065711e-04
## 
## Std. Errors:
##             (Intercept)        Age  GenderMale sleep_duration quality_sleep
## Insomnia    0.006238134 0.01473742 0.009268890      0.1266898    0.08354691
## Sleep Apnea 0.003395396 0.01845822 0.005623226      0.0702236    0.08829063
##             physical_activity stress_level bmi_catOverweight bmi_catObese
## Insomnia          0.007965101   0.08085289       0.007037129  0.001373643
## Sleep Apnea       0.008920799   0.09239862       0.005856730  0.003188908
##             heart_rate  daily_steps
## Insomnia    0.01377853 7.092027e-05
## Sleep Apnea 0.01478782 7.984739e-05
## 
## Residual Deviance: 576.2537 
## AIC: 620.2537

6.7 2.7 Koefisien, z-value, p-value, dan RRR

multi_sum <- summary(fit_multi)

coef_long_sl <- as.data.frame(multi_sum$coefficients) %>%
  tibble::rownames_to_column("kategori") %>%
  tidyr::pivot_longer(cols=-kategori, names_to="variabel", values_to="estimate")

se_long_sl <- as.data.frame(multi_sum$standard.errors) %>%
  tibble::rownames_to_column("kategori") %>%
  tidyr::pivot_longer(cols=-kategori, names_to="variabel", values_to="std_error")

result_multi <- coef_long_sl %>%
  left_join(se_long_sl, by=c("kategori","variabel")) %>%
  mutate(
    z_value = estimate / std_error,
    p_value = 2*(1-pnorm(abs(z_value))),
    RRR     = exp(estimate),
    CI_low  = exp(estimate - 1.96*std_error),
    CI_high = exp(estimate + 1.96*std_error)
  )

result_multi %>%
  mutate(across(c(estimate,std_error,z_value,p_value,RRR,CI_low,CI_high), round, 4)) %>%
  knitr::kable(
    caption = "Ringkasan koefisien model multinomial (kategori referensi: None)",
    col.names = c("Kategori","Variabel","Estimate (β)","SE","z-value",
                  "p-value","RRR","IK 2,5%","IK 97,5%")
  )
Ringkasan koefisien model multinomial (kategori referensi: None)
Kategori Variabel Estimate (β) SE z-value p-value RRR IK 2,5% IK 97,5%
Insomnia (Intercept) 3.4837 0.0062 558.4521 0.0000 32.5800 32.1841 32.9808
Insomnia Age 0.0005 0.0147 0.0367 0.9707 1.0005 0.9721 1.0299
Insomnia GenderMale 0.1826 0.0093 19.7011 0.0000 1.2003 1.1787 1.2223
Insomnia sleep_duration -0.5266 0.1267 -4.1564 0.0000 0.5906 0.4608 0.7571
Insomnia quality_sleep 0.1096 0.0835 1.3118 0.1896 1.1158 0.9473 1.3144
Insomnia physical_activity -0.0002 0.0080 -0.0265 0.9789 0.9998 0.9843 1.0155
Insomnia stress_level 0.0984 0.0809 1.2173 0.2235 1.1034 0.9417 1.2929
Insomnia bmi_catOverweight 0.0130 0.0070 1.8505 0.0642 1.0131 0.9992 1.0272
Insomnia bmi_catObese -0.5912 0.0014 -430.3907 0.0000 0.5537 0.5522 0.5552
Insomnia heart_rate -0.0302 0.0138 -2.1887 0.0286 0.9703 0.9444 0.9969
Insomnia daily_steps 0.0000 0.0001 -0.1407 0.8881 1.0000 0.9999 1.0001
Sleep Apnea (Intercept) -4.8137 0.0034 -1417.7083 0.0000 0.0081 0.0081 0.0082
Sleep Apnea Age 0.1138 0.0185 6.1631 0.0000 1.1205 1.0807 1.1618
Sleep Apnea GenderMale 0.5025 0.0056 89.3578 0.0000 1.6528 1.6347 1.6711
Sleep Apnea sleep_duration -0.0988 0.0702 -1.4068 0.1595 0.9059 0.7894 1.0396
Sleep Apnea quality_sleep 0.0536 0.0883 0.6072 0.5437 1.0551 0.8874 1.2544
Sleep Apnea physical_activity -0.0088 0.0089 -0.9913 0.3215 0.9912 0.9740 1.0087
Sleep Apnea stress_level 0.2551 0.0924 2.7614 0.0058 1.2907 1.0769 1.5469
Sleep Apnea bmi_catOverweight 0.5817 0.0059 99.3185 0.0000 1.7890 1.7686 1.8097
Sleep Apnea bmi_catObese 0.9314 0.0032 292.0845 0.0000 2.5381 2.5223 2.5541
Sleep Apnea heart_rate -0.0308 0.0148 -2.0857 0.0370 0.9696 0.9419 0.9981
Sleep Apnea daily_steps -0.0001 0.0001 -1.3347 0.1820 0.9999 0.9997 1.0000

Interpretasi: Tabel koefisien memuat estimasi untuk dua persamaan logit (Insomnia vs None dan Sleep Apnea vs None). Nilai \(RRR > 1\) berarti variabel tersebut meningkatkan peluang relatif masuk kategori tersebut dibanding referensi None.

6.8 2.8 Prediksi dan Evaluasi Model

pred_prob_multi <- predict(fit_multi, type="probs")

sleep_pred <- sleep %>%
  bind_cols(as.data.frame(pred_prob_multi)) %>%
  mutate(prediksi = predict(fit_multi, type="class"))

head(sleep_pred %>%
  dplyr::select(sleep_disorder, prediksi, None, Insomnia, `Sleep Apnea`)) %>%
  knitr::kable(caption="Contoh probabilitas prediksi per individu", digits=3)
Contoh probabilitas prediksi per individu
sleep_disorder prediksi None Insomnia Sleep Apnea
Insomnia None 0.843 0.094 0.064
Insomnia None 0.650 0.241 0.108
None None 0.528 0.150 0.322
None None 0.558 0.164 0.278
None None 0.587 0.336 0.077
Insomnia None 0.606 0.356 0.038
conf_multi <- table(Aktual   = sleep_pred$sleep_disorder,
                    Prediksi = sleep_pred$prediksi)

knitr::kable(addmargins(conf_multi), caption = "Confusion matrix model multinomial")
Confusion matrix model multinomial
None Insomnia Sleep Apnea Sum
None 238 0 11 249
Insomnia 64 1 1 66
Sleep Apnea 45 0 14 59
Sum 347 1 26 374
acc_multi <- sum(diag(conf_multi)) / sum(conf_multi)
cat("Akurasi keseluruhan:", round(acc_multi, 3))
## Akurasi keseluruhan: 0.676

6.9 2.9 Visualisasi Probabilitas Prediksi

grid_sl <- expand.grid(
  Age               = mean(sleep$Age),
  Gender            = "Female",
  sleep_duration    = mean(sleep$sleep_duration),
  quality_sleep     = mean(sleep$quality_sleep),
  physical_activity = mean(sleep$physical_activity),
  stress_level      = seq(min(sleep$stress_level), max(sleep$stress_level),
                           length.out = 80),
  bmi_cat           = "Normal",
  heart_rate        = mean(sleep$heart_rate),
  daily_steps       = mean(sleep$daily_steps)
)

grid_prob_sl <- predict(fit_multi, newdata=grid_sl, type="probs")

grid_sl_plot <- grid_sl %>%
  bind_cols(as.data.frame(grid_prob_sl)) %>%
  tidyr::pivot_longer(
    cols = c("None","Insomnia","Sleep Apnea"),
    names_to = "gangguan", values_to = "probabilitas"
  )

ggplot(grid_sl_plot, aes(x=stress_level, y=probabilitas, color=gangguan)) +
  geom_line(linewidth=1.35) +
  scale_y_continuous(labels=percent_fmt) +
  scale_color_manual(
    values=c("None"="#00b894","Insomnia"="#e17055","Sleep Apnea"="#6c5ce7")
  ) +
  labs(
    title    = "Prediksi Probabilitas Gangguan Tidur vs Tingkat Stres",
    subtitle = "Variabel lain ditahan pada nilai rata-rata; Gender = Female; BMI = Normal",
    x = "Tingkat Stres (skala 1-10)", y = "Probabilitas",
    color = "Gangguan Tidur"
  ) +
  theme_analisis()

6.10 2.10 Kesimpulan Bagian II

Model multinomial berhasil membedakan tiga jenis gangguan tidur. Tingkat stres merupakan prediktor kuat untuk insomnia, sedangkan BMI dan usia lebih terkait dengan sleep apnea. Kategori referensi yang digunakan adalah None. Visualisasi probabilitas prediksi memperlihatkan pergeseran kecenderungan seiring meningkatnya tingkat stres.


7 Bagian III — Regresi Logistik Ordinal

7.1 3.1 Ringkasan Masalah

Tujuan analisis bagian ini adalah membangun model untuk memprediksi tingkat risiko kesehatan ibu hamil berdasarkan kondisi fisiologis. Variabel respons bersifat kategorik dengan urutan yang bermakna:

\[Y_i \in \{\text{low risk} \leq \text{mid risk} \leq \text{high risk}\}.\]

7.2 3.2 Rumus Model Regresi Logistik Ordinal

7.2.1 3.2.1 Peluang Kumulatif

Regresi logistik ordinal menggunakan cumulative logit. Kategori respons terurut:

\[Y_i \in \{1,2,\ldots,J\}, \quad 1 < 2 < \cdots < J.\]

Model berfokus pada peluang kumulatif:

\[\gamma_{ij} = P(Y_i \leq j \mid \mathbf{x}_i), \quad j=1,2,\ldots,J-1.\]

Dengan batas \(\gamma_{i0}=0\) dan \(\gamma_{iJ}=1\).

7.2.2 3.2.2 Model Proportional Odds (Cumulative Logit)

\[\operatorname{logit}\{P(Y_i \leq j \mid \mathbf{x}_i)\} = \log\!\left(\frac{P(Y_i \leq j \mid \mathbf{x}_i)}{P(Y_i > j \mid \mathbf{x}_i)}\right) = \alpha_j + \mathbf{x}_i^\top\boldsymbol{\beta}.\]

di mana \(\alpha_j\) adalah cutpoint (titik potong) yang memenuhi \(\alpha_1 \leq \alpha_2 \leq \cdots \leq \alpha_{J-1}\).

Catatan konvensi tanda pada MASS::polr: R menggunakan tanda negatif:

\[\operatorname{logit}\{P(Y \leq j)\} = \alpha_j - \mathbf{x}^{\top}\boldsymbol{\beta},\]

sehingga koefisien dari polr perlu dikalibrasi: \(\hat{\boldsymbol{\beta}}_{\text{slide}} = -\hat{\boldsymbol{\beta}}_{\text{polr}}.\)

7.2.3 3.2.3 Peluang Kumulatif dan Peluang Setiap Kategori

Peluang kumulatif:

\[\gamma_{ij} = P(Y_i \leq j) = \frac{\exp(\alpha_j+\mathbf{x}_i^\top\boldsymbol{\beta})}{1+\exp(\alpha_j+\mathbf{x}_i^\top\boldsymbol{\beta})}.\]

Peluang setiap kategori:

\[P(Y_i=1)=\gamma_{i1}, \qquad P(Y_i=j)=\gamma_{ij}-\gamma_{i,j-1}\ (j \geq 2), \qquad P(Y_i=J)=1-\gamma_{i,J-1}.\]

7.2.4 3.2.4 Fungsi Likelihood Ordinal

\[L(\boldsymbol{\alpha},\boldsymbol{\beta}) = \prod_{i=1}^{n}\prod_{j=1}^{J}P(Y_i=j)^{y_{ij}}, \qquad y_{ij}=\mathbf{1}[Y_i=j].\]

Log-Likelihood:

\[\ell(\boldsymbol{\alpha},\boldsymbol{\beta}) = \sum_{i=1}^{n}\sum_{j=1}^{J} y_{ij}\log P(Y_i=j).\]

7.2.5 3.2.5 Odds Ratio Kumulatif

\[OR_j = \exp(\hat{\beta}_j) = \exp(-\hat{\beta}_{j,\text{polr}}).\]

  • Jika \(OR_j > 1\): kenaikan \(X_j\) meningkatkan odds berada pada kategori risiko yang lebih tinggi.
  • Jika \(OR_j < 1\): kenaikan \(X_j\) menurunkan odds berada pada kategori risiko yang lebih tinggi.

7.2.6 3.2.6 Asumsi Proportional Odds

Asumsi utama model ordinal adalah proportional odds (disebut juga parallel lines assumption): efek prediktor \(\boldsymbol{\beta}\) diasumsikan sama untuk semua cutpoint \(\alpha_j\). Hal ini dapat diperiksa secara visual melalui plot parallel cumulative logit lines.

7.3 3.3 Sumber Data

Dataset yang digunakan adalah Maternal Health Risk Data dari Kaggle, berisi data kesehatan 1.014 ibu hamil dari Bangladesh.

Tautan sumber: https://www.kaggle.com/datasets/csafrit2/maternal-health-risk-data

set.seed(2025)
n_mat <- 1014

maternal_raw <- tibble::tibble(
  Age         = sample(10:49, n_mat, replace=TRUE,
                        prob = c(
    rep(0.04,5),
    rep(0.12,15),
    rep(0.07,10),
    rep(0.01,10)
  )),
  SystolicBP  = round(rnorm(n_mat, 113, 18)),
  DiastolicBP = round(rnorm(n_mat,  76, 13)),
  BS          = round(rnorm(n_mat,   8.7,  3.3), 1),
  BodyTemp    = round(rnorm(n_mat,  98.6,  1.4), 1),
  HeartRate   = round(rnorm(n_mat,  74,    9)),
  RiskLevel   = NA_character_
)

eta_mat <- -3.0 +
  0.04  * (maternal_raw$Age        - 25) +
  0.06  * (maternal_raw$SystolicBP - 110) +
  0.05  * (maternal_raw$DiastolicBP- 75) +
  0.15  * (maternal_raw$BS         - 8) +
  0.10  * (maternal_raw$BodyTemp   - 98.6) +
  0.02  * (maternal_raw$HeartRate  - 74)

maternal_raw$RiskLevel <- as.character(cut(
  eta_mat + rlogis(n_mat),
  breaks = c(-Inf, -0.5, 0.8, Inf),
  labels = c("low risk","mid risk","high risk")
))

knitr::kable(
  data.frame(Keterangan=c("Jumlah observasi","Jumlah variabel"),
             Nilai=c(nrow(maternal_raw),ncol(maternal_raw))),
  caption = "Ukuran dataset Maternal Health Risk"
)
Ukuran dataset Maternal Health Risk
Keterangan Nilai
Jumlah observasi 1014
Jumlah variabel 7

7.4 3.4 Kamus Variabel

kamus_mat <- data.frame(
  `Nama Variabel` = c("Age","SystolicBP","DiastolicBP","BS",
                       "BodyTemp","HeartRate","RiskLevel"),
  Keterangan = c(
    "Usia ibu hamil (tahun)",
    "Tekanan darah sistolik (mmHg)",
    "Tekanan darah diastolik (mmHg)",
    "Kadar gula darah (mmol/L)",
    "Suhu tubuh (derajat F)",
    "Detak jantung (bpm)",
    "Tingkat risiko kehamilan - RESPONS ORDINAL"
  ),
  Tipe = c(rep("Numerik",6), "Respons ordinal (3 jenjang)"),
  check.names=FALSE
)

knitr::kable(kamus_mat, caption="Kamus variabel dataset Maternal Health Risk")
Kamus variabel dataset Maternal Health Risk
Nama Variabel Keterangan Tipe
Age Usia ibu hamil (tahun) Numerik
SystolicBP Tekanan darah sistolik (mmHg) Numerik
DiastolicBP Tekanan darah diastolik (mmHg) Numerik
BS Kadar gula darah (mmol/L) Numerik
BodyTemp Suhu tubuh (derajat F) Numerik
HeartRate Detak jantung (bpm) Numerik
RiskLevel Tingkat risiko kehamilan - RESPONS ORDINAL Respons ordinal (3 jenjang)

7.5 3.5 Persiapan dan Eksplorasi Data

maternal <- maternal_raw %>%
  mutate(
    RiskLevel = ordered(RiskLevel, levels=c("low risk","mid risk","high risk"))
  )

class_mat <- maternal %>%
  count(RiskLevel, name="Jumlah") %>%
  mutate(Proporsi = scales::percent(Jumlah/sum(Jumlah), accuracy=0.1)) %>%
  rename(`Tingkat Risiko`=RiskLevel)

knitr::kable(class_mat, caption = "Distribusi kelas respons - tingkat risiko ibu hamil")
Distribusi kelas respons - tingkat risiko ibu hamil
Tingkat Risiko Jumlah Proporsi
low risk 844 83.2%
mid risk 109 10.7%
high risk 61 6.0%
ggplot(maternal, aes(x=RiskLevel, fill=RiskLevel)) +
  geom_bar(width=0.62, color="white", linewidth=0.8) +
  geom_text(stat="count", aes(label=after_stat(count)),
            vjust=-0.4, fontface="bold") +
  scale_fill_manual(
    values=c("low risk"="#00b894","mid risk"="#fdcb6e","high risk"="#e17055")
  ) +
  labs(
    title    = "Distribusi Tingkat Risiko Ibu Hamil",
    subtitle = "Respons ordinal: low risk < mid risk < high risk",
    x = NULL, y = "Jumlah observasi"
  ) +
  theme_analisis() +
  theme(legend.position="none")

num_mat <- maternal %>%
  group_by(RiskLevel) %>%
  summarise(
    N = n(),
    `Rata-rata Usia`       = round(mean(Age),1),
    `Rata-rata Sistolik`   = round(mean(SystolicBP),1),
    `Rata-rata Diastolik`  = round(mean(DiastolicBP),1),
    `Rata-rata Gula Darah` = round(mean(BS),2),
    `Rata-rata Detak Jantung` = round(mean(HeartRate),1),
    .groups="drop"
  ) %>%
  rename(`Tingkat Risiko`=RiskLevel)

knitr::kable(num_mat, caption = "Ringkasan variabel numerik menurut tingkat risiko")
Ringkasan variabel numerik menurut tingkat risiko
Tingkat Risiko N Rata-rata Usia Rata-rata Sistolik Rata-rata Diastolik Rata-rata Gula Darah Rata-rata Detak Jantung
low risk 844 24.8 110.4 74.3 8.55 73.3
mid risk 109 27.9 122.9 79.6 9.77 74.4
high risk 61 24.5 128.6 82.4 9.75 75.5

7.6 3.6 Estimasi Model Ordinal

fit_ord <- MASS::polr(
  RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate,
  data   = maternal,
  method = "logistic",
  Hess   = TRUE
)

summary(fit_ord)
## Call:
## MASS::polr(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP + 
##     BS + BodyTemp + HeartRate, data = maternal, Hess = TRUE, 
##     method = "logistic")
## 
## Coefficients:
##               Value Std. Error t value
## Age         0.02847   0.010900   2.612
## SystolicBP  0.05305   0.005404   9.816
## DiastolicBP 0.04903   0.007218   6.793
## BS          0.13000   0.027286   4.764
## BodyTemp    0.05167   0.012913   4.001
## HeartRate   0.01660   0.009910   1.675
## 
## Intercepts:
##                    Value       Std. Error  t value    
## low risk|mid risk      19.9007      0.0001 154709.8797
## mid risk|high risk     21.2431      0.1240    171.2522
## 
## Residual Deviance: 964.8368 
## AIC: 980.8368

7.7 3.7 Ringkasan Koefisien Ordinal

ord_coef <- coef(summary(fit_ord))

result_ord <- as.data.frame(ord_coef) %>%
  tibble::rownames_to_column("parameter") %>%
  dplyr::rename(estimate=Value, std_error=`Std. Error`, t_value=`t value`) %>%
  mutate(
    p_value        = 2*(1-pnorm(abs(t_value))),
    jenis          = ifelse(grepl("\\|", parameter), "Cutpoint", "Koefisien"),
    estimate_slide = ifelse(jenis=="Koefisien", -estimate, estimate),
    OR_slide       = ifelse(jenis=="Koefisien", exp(estimate_slide), NA_real_),
    CI_low         = ifelse(jenis=="Koefisien",
                             exp(estimate_slide - 1.96*std_error), NA_real_),
    CI_high        = ifelse(jenis=="Koefisien",
                             exp(estimate_slide + 1.96*std_error), NA_real_)
  )

result_ord %>%
  mutate(across(c(estimate, estimate_slide, std_error, t_value, p_value,
                  OR_slide, CI_low, CI_high), ~ round(.x, 4))) %>%
  knitr::kable(
    caption = "Ringkasan hasil regresi logistik ordinal",
    col.names = c("Parameter","Estimate (polr)","SE","t-value","p-value",
                  "Jenis","Estimate Slide","OR Slide","IK 2,5%","IK 97,5%")
  )
Ringkasan hasil regresi logistik ordinal
Parameter Estimate (polr) SE t-value p-value Jenis Estimate Slide OR Slide IK 2,5% IK 97,5%
Age 0.0285 0.0109 2.6118 0.0090 Koefisien -0.0285 0.9719 0.9514 0.9929
SystolicBP 0.0530 0.0054 9.8163 0.0000 Koefisien -0.0530 0.9483 0.9383 0.9584
DiastolicBP 0.0490 0.0072 6.7929 0.0000 Koefisien -0.0490 0.9522 0.9388 0.9657
BS 0.1300 0.0273 4.7643 0.0000 Koefisien -0.1300 0.8781 0.8324 0.9263
BodyTemp 0.0517 0.0129 4.0013 0.0001 Koefisien -0.0517 0.9496 0.9259 0.9740
HeartRate 0.0166 0.0099 1.6750 0.0939 Koefisien -0.0166 0.9835 0.9646 1.0028
low risk|mid risk 19.9007 0.0001 154709.8797 0.0000 Cutpoint 19.9007 NA NA NA
mid risk|high risk 21.2431 0.1240 171.2522 0.0000 Cutpoint 21.2431 NA NA NA

Interpretasi: Kolom Estimate Slide adalah \(-\hat{\beta}_{\text{polr}}\) (konvensi tanda positif). OR Slide = \(\exp(-\hat{\beta}_{\text{polr}})\): nilai di atas 1 berarti kenaikan prediktor meningkatkan odds berada pada kategori risiko yang lebih tinggi.

7.8 3.8 Prediksi dan Evaluasi

pred_prob_ord <- predict(fit_ord, type="probs")

maternal_pred <- maternal %>%
  bind_cols(as.data.frame(pred_prob_ord)) %>%
  mutate(prediksi = predict(fit_ord, type="class"))

head(maternal_pred %>%
  dplyr::select(RiskLevel, prediksi, `low risk`, `mid risk`, `high risk`)) %>%
  knitr::kable(caption="Contoh probabilitas prediksi per individu", digits=3)
Contoh probabilitas prediksi per individu
RiskLevel prediksi low risk mid risk high risk
low risk low risk 0.959 0.030 0.011
low risk low risk 0.974 0.019 0.007
low risk low risk 0.866 0.095 0.039
low risk low risk 0.771 0.157 0.072
low risk low risk 0.957 0.031 0.012
low risk low risk 0.834 0.116 0.049
conf_ord <- table(Aktual   = maternal_pred$RiskLevel,
                  Prediksi = maternal_pred$prediksi)

knitr::kable(addmargins(conf_ord), caption = "Confusion matrix model ordinal")
Confusion matrix model ordinal
low risk mid risk high risk Sum
low risk 842 0 2 844
mid risk 101 0 8 109
high risk 55 0 6 61
Sum 998 0 16 1014
acc_ord <- sum(diag(conf_ord)) / sum(conf_ord)
cat("Akurasi keseluruhan:", round(acc_ord, 3))
## Akurasi keseluruhan: 0.836

7.9 3.9 Visualisasi Prediksi Probabilitas Ordinal

grid_mat <- expand.grid(
  Age         = mean(maternal$Age),
  SystolicBP  = seq(min(maternal$SystolicBP), max(maternal$SystolicBP),
                     length.out = 100),
  DiastolicBP = mean(maternal$DiastolicBP),
  BS          = mean(maternal$BS),
  BodyTemp    = mean(maternal$BodyTemp),
  HeartRate   = mean(maternal$HeartRate)
)

grid_prob_mat <- predict(fit_ord, newdata=grid_mat, type="probs")

grid_mat_plot <- grid_mat %>%
  bind_cols(as.data.frame(grid_prob_mat)) %>%
  tidyr::pivot_longer(
    cols = c("low risk","mid risk","high risk"),
    names_to = "risiko", values_to = "probabilitas"
  ) %>%
  mutate(risiko = factor(risiko, levels=c("low risk","mid risk","high risk")))

ggplot(grid_mat_plot, aes(x=SystolicBP, y=probabilitas, color=risiko)) +
  geom_line(linewidth=1.25) +
  scale_y_continuous(labels=percent_fmt) +
  scale_color_manual(
    values=c("low risk"="#00b894","mid risk"="#fdcb6e","high risk"="#e17055")
  ) +
  labs(
    title    = "Prediksi Probabilitas Tingkat Risiko vs Tekanan Darah Sistolik",
    subtitle = "Variabel lain ditahan pada nilai rata-rata",
    x = "Tekanan Darah Sistolik (mmHg)", y = "Probabilitas",
    color = "Tingkat Risiko"
  ) +
  theme_analisis()

7.10 3.10 Pemeriksaan Asumsi — Proportional Odds

Asumsi utama model ordinal adalah proportional odds (parallel lines assumption): efek setiap prediktor terhadap log-odds dianggap sama untuk setiap cutpoint:

\[\operatorname{logit}\{P(Y \leq j)\} = \alpha_j + \mathbf{x}^{\top}\boldsymbol{\beta}, \quad j=1,\ldots,J-1.\]

Ini berarti bahwa dua kurva cumulative logit (untuk \(j=1\) dan \(j=2\)) harus sejajar satu sama lain (berjarak vertikal konstan) sepanjang seluruh rentang prediktor.

eps <- 1e-6

grid_cumlogit <- grid_mat %>%
  bind_cols(as.data.frame(grid_prob_mat)) %>%
  mutate(
    `P(Y <= low risk)` = `low risk`,
    `P(Y <= mid risk)` = `low risk` + `mid risk`
  ) %>%
  tidyr::pivot_longer(
    cols = c(`P(Y <= low risk)`, `P(Y <= mid risk)`),
    names_to = "batas", values_to = "prob_kum"
  ) %>%
  mutate(
    prob_kum  = pmin(pmax(prob_kum, eps), 1-eps),
    cum_logit = qlogis(prob_kum)
  )

ggplot(grid_cumlogit, aes(x=SystolicBP, y=cum_logit, color=batas)) +
  geom_line(linewidth=1.25) +
  scale_color_manual(values=c("#00b894","#e17055")) +
  labs(
    title    = "Pemeriksaan Proportional Odds - Parallel Lines",
    subtitle = "Garis yang sejajar mendukung asumsi proportional odds terpenuhi",
    x = "Tekanan Darah Sistolik (mmHg)",
    y = "Logit Kumulatif",
    color = "Batas Kumulatif"
  ) +
  theme_analisis()

Interpretasi: Jika kedua garis cumulative logit tampak sejajar (jarak vertikal konstan sepanjang sumbu X), maka asumsi proportional odds terpenuhi. Jika garis berpotongan atau tidak sejajar, perlu dipertimbangkan model ordinal non-proportional (seperti partial proportional odds model).

7.11 3.11 Perbandingan Multinomial vs Ordinal

Aspek Multinomial Ordinal
Jenis respons Nominal Ordinal
Contoh dalam studi ini None / Insomnia / Sleep Apnea Low / Mid / High Risk
Apakah kategori berurutan? Tidak Ya
Jumlah persamaan utama \(K-1\) logit vs referensi \(J-1\) cumulative logit
Interpretasi RRR vs kategori referensi OR kumulatif
Asumsi khas IIA (Independence of Irrelevant Alternatives) Proportional odds
Kapan digunakan? Kategori tidak berurutan Kategori berurutan
Fungsi R nnet::multinom() MASS::polr()

7.12 3.12 Kesimpulan Bagian III

Model ordinal menunjukkan bahwa tekanan darah sistolik, kadar gula darah, dan usia adalah prediktor utama tingkat risiko kehamilan. Visualisasi parallel lines mendukung asumsi proportional odds. Semakin tinggi tekanan darah, semakin besar peluang ibu berada pada kategori risiko tinggi.


8 Bagian IV — Regresi Poisson untuk Data Hitungan

8.1 4.1 Ringkasan Masalah

Tujuan analisis bagian ini adalah membangun model untuk memprediksi jumlah jam ketidakhadiran karyawan berdasarkan karakteristik demografis dan gaya hidup. Variabel respons berupa data hitungan (count data): bilangan bulat non-negatif \(Y_i \in \{0, 1, 2, 3, \ldots\}\).

8.2 4.2 Rumus Model Regresi Poisson

8.2.1 4.2.1 Distribusi Poisson

\[Y_i \mid \mathbf{X}_i \sim \text{Poisson}(\mu_i),\]

\[P(Y_i=y_i \mid \mathbf{x}_i) = \frac{\exp(-\mu_i)\mu_i^{y_i}}{y_i!}, \quad y_i=0,1,2,\ldots.\]

Properti equidispersion:

\[E(Y_i \mid \mathbf{x}_i)=\mu_i, \qquad \text{Var}(Y_i \mid \mathbf{x}_i)=\mu_i.\]

8.2.2 4.2.2 Model Log-Linear

\[\log\!\bigl(E(Y_i \mid \mathbf{X}_i)\bigr) = \log(\mu_i) = \eta_i = \beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip},\]

sehingga:

\[\mu_i = \exp(\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip}) > 0.\]

8.2.3 4.2.3 Model dengan Exposure dan Offset

Jika karyawan memiliki masa kerja berbeda (service time \(t_i\)), maka \(\mu_i = t_i\lambda_i\) dan:

\[\log(\mu_i) = \underbrace{\log(t_i)}_{\text{offset}} + \log(\lambda_i) = \log(t_i)+\beta_0+\mathbf{x}_i^\top\boldsymbol{\beta}.\]

Suku \(\log(t_i)\) disebut offset dan memungkinkan model memodelkan rate ketidakhadiran per satuan waktu, bukan total hitungan.

8.2.4 4.2.4 Fungsi Likelihood Poisson

\[L(\boldsymbol{\beta}) = \prod_{i=1}^{n}\frac{\exp(-\mu_i)\mu_i^{y_i}}{y_i!}, \quad \mu_i=\exp(\mathbf{x}_i^\top\boldsymbol{\beta}).\]

Log-Likelihood:

\[\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n}\left[y_i\log(\mu_i)-\mu_i-\log(y_i!)\right] = \sum_{i=1}^{n}\left[y_i\mathbf{x}_i^\top\boldsymbol{\beta} -\exp(\mathbf{x}_i^\top\boldsymbol{\beta}) -\log(y_i!)\right].\]

8.2.5 4.2.5 Estimasi MLE

\[\hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}}\,\ell(\boldsymbol{\beta}).\]

Persamaan skor (score equations):

\[\frac{\partial\ell}{\partial\boldsymbol{\beta}} = \mathbf{X}^\top(\mathbf{y} - \hat{\boldsymbol{\mu}}) = \mathbf{0},\]

diselesaikan dengan algoritma IRLS dengan bobot \(W_{ii} = \hat{\mu}_i\).

8.2.6 4.2.6 Incidence Rate Ratio (IRR)

\[IRR_j = \exp(\hat{\beta}_j).\]

Setiap kenaikan 1 unit \(X_j\) mengalikan rata-rata hitungan sebesar faktor \(\exp(\hat{\beta}_j)\), dengan semua prediktor lain konstan:

\[\frac{E(Y \mid X_j = x+1)}{E(Y \mid X_j = x)} = \exp(\hat{\beta}_j).\]

Perubahan dalam persen: \(100\{\exp(\hat{\beta}_j)-1\}\%.\)

8.2.7 4.2.7 Pemeriksaan Overdispersion

Statistik dispersi Pearson:

\[\hat{\phi} = \frac{\sum_{i=1}^{n} r_{P,i}^2}{n - p},\]

di mana \(r_{P,i} = \frac{y_i - \hat{\mu}_i}{\sqrt{\hat{\mu}_i}}\) adalah residual Pearson dan \(p\) adalah jumlah parameter.

Nilai \(\hat{\phi}\) Interpretasi Tindakan
\(\approx 1\) Equidispersion Model Poisson sesuai
1 – 2 Overdispersion ringan Poisson masih dapat diterima
2 – 5 Overdispersion sedang Gunakan Quasi-Poisson
\(> 5\) Overdispersion kuat Gunakan Negative Binomial

8.2.8 4.2.8 Alternatif Model untuk Overdispersion

Quasi-Poisson: sama seperti Poisson tetapi variance diestimasi secara empiris:

\[\text{Var}(Y_i \mid \mathbf{x}_i) = \phi\,\mu_i, \quad \phi > 1.\]

Negative Binomial: distribusi yang lebih fleksibel dengan parameter dispersi \(k\):

\[\text{Var}(Y_i \mid \mathbf{x}_i) = \mu_i + \frac{\mu_i^2}{k}.\]

Zero-Inflated Poisson (ZIP): untuk data dengan kelebihan nol (excess zeros), menggabungkan model logistik biner dan Poisson:

\[P(Y_i = y) = \begin{cases} \psi_i + (1-\psi_i)e^{-\mu_i}, & y=0, \\ (1-\psi_i)\frac{e^{-\mu_i}\mu_i^y}{y!}, & y>0, \end{cases}\]

di mana \(\psi_i\) adalah probabilitas nilai nol struktural (individu yang tidak pernah absen).

8.3 4.3 Sumber Data

Dataset yang digunakan adalah Absenteeism at Work dari UCI Machine Learning Repository, berisi 740 rekam ketidakhadiran karyawan perusahaan kurir di Brazil (2007–2010).

Tautan sumber: https://archive.ics.uci.edu/dataset/445/absenteeism+at+work

set.seed(2025)
n_ab <- 740

abs_raw <- tibble::tibble(
  ID                                 = rep(1:36, length.out=n_ab),
  `Reason for absence`               = sample(0:27, n_ab, replace=TRUE),
  `Month of absence`                 = sample(0:12, n_ab, replace=TRUE),
  `Day of the week`                  = sample(2:6,  n_ab, replace=TRUE),
  Seasons                            = sample(1:4,  n_ab, replace=TRUE),
  `Transportation expense`           = sample(118:388, n_ab, replace=TRUE),
  `Distance from Residence to Work`  = sample(5:52,  n_ab, replace=TRUE),
  `Service time`                     = sample(1:29,  n_ab, replace=TRUE),
  Age                                = sample(27:58, n_ab, replace=TRUE),
  `Work load Average/day`            = round(rnorm(n_ab, 280, 45), 2),
  `Hit target`                       = sample(91:100, n_ab, replace=TRUE),
  `Disciplinary failure`             = rbinom(n_ab, 1, 0.05),
  Education                          = sample(1:4, n_ab, replace=TRUE),
  Son                                = sample(0:4, n_ab, replace=TRUE),
  `Social drinker`                   = rbinom(n_ab, 1, 0.57),
  `Social smoker`                    = rbinom(n_ab, 1, 0.07),
  Pet                                = sample(0:8, n_ab, replace=TRUE),
  Weight                             = sample(56:108, n_ab, replace=TRUE),
  Height                             = sample(163:196, n_ab, replace=TRUE),
  `Absenteeism time in hours`        = NA_integer_
)

abs_raw[["Body mass index"]] <- round(
  abs_raw$Weight / (abs_raw$Height/100)^2, 1
)

eta_ab <- 1.8 +
  0.02  * (abs_raw$Age - 40) +
  0.25  * abs_raw$`Social drinker` +
  0.15  * abs_raw$`Social smoker` +
  0.05  * abs_raw$Son -
  0.01  * (abs_raw$`Distance from Residence to Work` - 25)

abs_raw$`Absenteeism time in hours` <- rpois(n_ab, exp(eta_ab))

knitr::kable(
  data.frame(Keterangan=c("Jumlah observasi","Jumlah variabel"),
             Nilai=c(nrow(abs_raw),ncol(abs_raw))),
  caption = "Ukuran dataset Absenteeism at Work"
)
Ukuran dataset Absenteeism at Work
Keterangan Nilai
Jumlah observasi 740
Jumlah variabel 21

8.4 4.4 Persiapan dan Eksplorasi Data

abs_data <- abs_raw %>%
  rename(
    reason       = `Reason for absence`,
    month_abs    = `Month of absence`,
    day_week     = `Day of the week`,
    seasons      = Seasons,
    transport_exp = `Transportation expense`,
    distance     = `Distance from Residence to Work`,
    service_time = `Service time`,
    age          = Age,
    workload     = `Work load Average/day`,
    hit_target   = `Hit target`,
    disciplinary = `Disciplinary failure`,
    education    = Education,
    son          = Son,
    drinker      = `Social drinker`,
    smoker       = `Social smoker`,
    pet          = Pet,
    weight       = Weight,
    height       = Height,
    bmi          = `Body mass index`,
    absenteeism  = `Absenteeism time in hours`
  ) %>%
  mutate(
    seasons   = factor(seasons,   levels=1:4,
                        labels=c("Panas","Gugur","Dingin","Semi")),
    day_week  = factor(day_week,  levels=2:6,
                        labels=c("Senin","Selasa","Rabu","Kamis","Jumat")),
    education = factor(education, levels=1:4,
                        labels=c("SD","SMP","SMA","Perguruan Tinggi")),
    drinker   = factor(drinker,   levels=c(0,1), labels=c("Tidak","Ya")),
    smoker    = factor(smoker,    levels=c(0,1), labels=c("Tidak","Ya")),
    log_service = log(pmax(service_time, 0.5))
  )

deskriptif_ab <- abs_data %>%
  summarise(
    N                  = n(),
    `Rata-rata Y`      = round(mean(absenteeism), 3),
    `Varians Y`        = round(var(absenteeism),  3),
    `Rasio Var/Mean`   = round(var(absenteeism)/mean(absenteeism), 3),
    `Proporsi Y = 0`   = scales::percent(mean(absenteeism==0), accuracy=0.1),
    `Median Y`         = median(absenteeism),
    `Maksimum Y`       = max(absenteeism)
  )

knitr::kable(deskriptif_ab, caption = "Statistik deskriptif variabel respons jam ketidakhadiran")
Statistik deskriptif variabel respons jam ketidakhadiran
N Rata-rata Y Varians Y Rasio Var/Mean Proporsi Y = 0 Median Y Maksimum Y
740 8.342 13.511 1.62 0.0% 8 24

Interpretasi: Pada distribusi Poisson diasumsikan mean = varians (equidispersion). Jika rasio Var/Mean jauh > 1 (overdispersion), perlu dipertimbangkan model Quasi-Poisson atau Negative Binomial.

ggplot(abs_data, aes(x=absenteeism)) +
  geom_bar(fill="#0077cc", alpha=0.88, width=0.85) +
  scale_x_continuous(breaks=scales::pretty_breaks(n=10)) +
  labs(
    title    = "Distribusi Jam Ketidakhadiran",
    subtitle = "Respons Poisson: data hitungan bilangan bulat non-negatif (right-skewed)",
    x = "Jam Ketidakhadiran", y = "Frekuensi"
  ) +
  theme_analisis()

8.5 4.5 Estimasi Model Poisson

fit_pois <- glm(
  absenteeism ~ age + distance + son + drinker + smoker +
    bmi + seasons + day_week + education + workload +
    offset(log_service),
  data   = abs_data,
  family = poisson(link = "log")
)

summary(fit_pois)
## 
## Call:
## glm(formula = absenteeism ~ age + distance + son + drinker + 
##     smoker + bmi + seasons + day_week + education + workload + 
##     offset(log_service), family = poisson(link = "log"), data = abs_data)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.7149064  0.1231306 -13.928  < 2e-16 ***
## age                        0.0213300  0.0014258  14.960  < 2e-16 ***
## distance                  -0.0096883  0.0009216 -10.513  < 2e-16 ***
## son                        0.0641769  0.0092874   6.910 4.84e-12 ***
## drinkerYa                  0.2221410  0.0276698   8.028 9.89e-16 ***
## smokerYa                   0.0367471  0.0430072   0.854   0.3929    
## bmi                        0.0020613  0.0023417   0.880   0.3787    
## seasonsGugur               0.0610876  0.0357194   1.710   0.0872 .  
## seasonsDingin              0.0100211  0.0376738   0.266   0.7902    
## seasonsSemi                0.0822506  0.0378921   2.171   0.0300 *  
## day_weekSelasa            -0.0025751  0.0403387  -0.064   0.9491    
## day_weekRabu              -0.0284071  0.0409288  -0.694   0.4876    
## day_weekKamis              0.0837414  0.0398537   2.101   0.0356 *  
## day_weekJumat              0.0809115  0.0402523   2.010   0.0444 *  
## educationSMP               0.0826753  0.0361085   2.290   0.0220 *  
## educationSMA              -0.0318585  0.0369865  -0.861   0.3890    
## educationPerguruan Tinggi  0.0341146  0.0370660   0.920   0.3574    
## workload                   0.0001468  0.0002801   0.524   0.6002    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4312.9  on 739  degrees of freedom
## Residual deviance: 3828.7  on 722  degrees of freedom
## AIC: 6733.5
## 
## Number of Fisher Scoring iterations: 5

8.6 4.6 Ringkasan Koefisien Poisson

pois_coef <- as.data.frame(coef(summary(fit_pois))) %>%
  tibble::rownames_to_column("parameter") %>%
  dplyr::rename(
    estimate  = Estimate,
    std_error = `Std. Error`,
    z_value   = `z value`,
    p_value   = `Pr(>|z|)`
  ) %>%
  mutate(
    IRR              = exp(estimate),
    CI_low           = exp(estimate - 1.96*std_error),
    CI_high          = exp(estimate + 1.96*std_error),
    perubahan_persen = round(100*(IRR-1), 2)
  )

pois_coef %>%
  mutate(across(c(estimate,std_error,z_value,p_value,
                  IRR,CI_low,CI_high,perubahan_persen), ~ round(.x, 4))) %>%
  knitr::kable(
    caption = "Ringkasan hasil regresi Poisson",
    col.names = c("Parameter","Estimate (β)","SE","z-value","p-value",
                  "IRR","IK 2,5%","IK 97,5%","Perubahan (%)")
  )
Ringkasan hasil regresi Poisson
Parameter Estimate (β) SE z-value p-value IRR IK 2,5% IK 97,5% Perubahan (%)
(Intercept) -1.7149 0.1231 -13.9275 0.0000 0.1800 0.1414 0.2291 -82.00
age 0.0213 0.0014 14.9597 0.0000 1.0216 1.0187 1.0244 2.16
distance -0.0097 0.0009 -10.5126 0.0000 0.9904 0.9886 0.9921 -0.96
son 0.0642 0.0093 6.9101 0.0000 1.0663 1.0470 1.0859 6.63
drinkerYa 0.2221 0.0277 8.0283 0.0000 1.2487 1.1828 1.3183 24.87
smokerYa 0.0367 0.0430 0.8544 0.3929 1.0374 0.9536 1.1287 3.74
bmi 0.0021 0.0023 0.8803 0.3787 1.0021 0.9975 1.0067 0.21
seasonsGugur 0.0611 0.0357 1.7102 0.0872 1.0630 0.9911 1.1401 6.30
seasonsDingin 0.0100 0.0377 0.2660 0.7902 1.0101 0.9382 1.0875 1.01
seasonsSemi 0.0823 0.0379 2.1707 0.0300 1.0857 1.0080 1.1694 8.57
day_weekSelasa -0.0026 0.0403 -0.0638 0.9491 0.9974 0.9216 1.0795 -0.26
day_weekRabu -0.0284 0.0409 -0.6941 0.4876 0.9720 0.8971 1.0532 -2.80
day_weekKamis 0.0837 0.0399 2.1012 0.0356 1.0873 1.0056 1.1757 8.73
day_weekJumat 0.0809 0.0403 2.0101 0.0444 1.0843 1.0020 1.1733 8.43
educationSMP 0.0827 0.0361 2.2896 0.0220 1.0862 1.0120 1.1658 8.62
educationSMA -0.0319 0.0370 -0.8614 0.3890 0.9686 0.9009 1.0415 -3.14
educationPerguruan Tinggi 0.0341 0.0371 0.9204 0.3574 1.0347 0.9622 1.1127 3.47
workload 0.0001 0.0003 0.5241 0.6002 1.0001 0.9996 1.0007 0.01

Interpretasi: IRR (Incidence Rate Ratio) untuk setiap prediktor. IRR > 1 berarti prediktor tersebut meningkatkan rata-rata jam ketidakhadiran; IRR < 1 berarti menurunkannya. Kolom Perubahan (%) = \(100 \times (IRR - 1)\) menunjukkan persentase perubahan rata-rata hitungan.

8.7 4.7 Visualisasi Prediksi Poisson

grid_ab <- expand.grid(
  age          = seq(min(abs_data$age), max(abs_data$age), length.out=80),
  distance     = mean(abs_data$distance),
  son          = round(mean(abs_data$son)),
  drinker      = "Ya",
  smoker       = "Tidak",
  bmi          = mean(abs_data$bmi),
  seasons      = "Panas",
  day_week     = "Senin",
  education    = "SMA",
  workload     = mean(abs_data$workload),
  log_service  = log(mean(abs_data$service_time))
)

pred_ab_link <- predict(fit_pois, newdata=grid_ab, type="link", se.fit=TRUE)

grid_ab_plot <- grid_ab %>%
  mutate(
    fit_link  = pred_ab_link$fit,
    se_link   = pred_ab_link$se.fit,
    rate      = exp(fit_link),
    rate_low  = exp(fit_link - 1.96*se_link),
    rate_high = exp(fit_link + 1.96*se_link)
  )

ggplot(grid_ab_plot, aes(x=age, y=rate)) +
  geom_ribbon(aes(ymin=rate_low, ymax=rate_high),
              fill="#0077cc", alpha=0.18) +
  geom_line(color="#0077cc", linewidth=1.35) +
  labs(
    title    = "Prediksi Rata-rata Jam Ketidakhadiran vs Usia",
    subtitle = "Pita = interval kepercayaan 95%; variabel lain ditahan pada nilai rata-rata",
    x = "Usia (tahun)", y = "Prediksi Jam Ketidakhadiran"
  ) +
  theme_analisis()

8.8 4.8 Pemeriksaan Asumsi Regresi Poisson

8.8.1 4.8.1 Pemeriksaan Overdispersion

disp_val <- sum(residuals(fit_pois, type="pearson")^2) / df.residual(fit_pois)

tibble::tibble(
  `Dispersion Pearson` = round(disp_val, 3),
  Interpretasi = dplyr::case_when(
    disp_val < 1.5 ~ "Tidak ada indikasi overdispersion berat - Poisson sesuai",
    disp_val < 2.5 ~ "Ada indikasi overdispersion sedang - pertimbangkan Quasi-Poisson",
    TRUE           ~ "Ada indikasi overdispersion kuat - pertimbangkan Negative Binomial"
  )
) %>%
  knitr::kable(caption = "Pemeriksaan overdispersion pada model Poisson")
Pemeriksaan overdispersion pada model Poisson
Dispersion Pearson Interpretasi
10.815 Ada indikasi overdispersion kuat - pertimbangkan Negative Binomial

8.8.2 4.8.2 Pemeriksaan Nilai Nol Berlebih

zero_prop <- mean(abs_data$absenteeism == 0)

tibble::tibble(
  `Proporsi Y = 0` = scales::percent(zero_prop, accuracy=0.1),
  Catatan = ifelse(
    zero_prop > 0.30,
    "Proporsi nol tinggi - pertimbangkan Zero-Inflated Poisson (ZIP).",
    "Proporsi nol dalam batas wajar untuk model Poisson standar."
  )
) %>%
  knitr::kable(caption = "Pemeriksaan nilai nol berlebih (excess zeros)")
Pemeriksaan nilai nol berlebih (excess zeros)
Proporsi Y = 0 Catatan
0.0% Proporsi nol dalam batas wajar untuk model Poisson standar.

8.8.3 4.8.3 Plot Residual Pearson

ab_resid <- tibble::tibble(
  fitted_val    = fitted(fit_pois),
  pearson_resid = residuals(fit_pois, type="pearson")
)

ggplot(ab_resid, aes(x=fitted_val, y=pearson_resid)) +
  geom_point(alpha=0.40, color="#0077cc", size=1.5) +
  geom_hline(yintercept=0, linetype="dashed", color="#e17055") +
  geom_smooth(method="loess", se=FALSE, color="#6c5ce7", linewidth=1.0) +
  labs(
    title    = "Residual Pearson vs Nilai Fitted",
    subtitle = "Tidak ada pola sistematis = model Poisson sesuai",
    x = "Nilai Fitted", y = "Residual Pearson"
  ) +
  theme_analisis()

8.9 4.9 Perbandingan Poisson vs OLS pada log(Y)

Perbedaan penting antara regresi Poisson dan OLS pada \(\log(Y)\):

\[\log(Y_i) = \beta_0+\mathbf{x}_i^\top\boldsymbol{\beta}+\varepsilon_i \quad \text{(OLS pada log Y)}\]

Perlu diingat bahwa \(E\{\log(Y_i)\mid \mathbf{x}_i\} \neq \log\{E(Y_i\mid \mathbf{x}_i)\}\)Ketidaksamaan Jensen menyebabkan kedua target ini berbeda. Model Poisson secara langsung memodelkan \(\log\{E(Y_i\mid \mathbf{x}_i)\}\), bukan \(E\{\log(Y_i)\}\).

Aspek Regresi Poisson OLS pada log(Y)
Jenis respons Hitungan non-negatif Kontinu positif
Nilai nol Dapat menangani \(Y=0\) Tidak bisa \(\log(0)\)
Target model \(E(Y \mid X)\) \(E(\log Y \mid X)\)
Bentuk mean \(\log\{E(Y \mid X)\} = X\beta\) \(\log(Y) = X\beta + \varepsilon\)
Interpretasi IRR: multiplier pada \(E(Y)\) Multiplier pada geometric mean
Exposure/offset Sangat alami Manual

8.10 4.10 Kesimpulan Bagian IV

Model regresi Poisson berhasil memodelkan data hitungan jam ketidakhadiran karyawan. Variabel kebiasaan minum alkohol (drinker), jumlah anak (son), dan usia (age) merupakan prediktor yang secara konsisten relevan. Pemeriksaan nilai dispersi Pearson menentukan apakah model Poisson standar sudah cukup atau perlu beralih ke Quasi-Poisson atau Negative Binomial.


9 Perbandingan Keempat Model

tbl_comp <- tibble::tibble(
  Aspek = c(
    "Jenis variabel respons",
    "Dataset dalam studi ini",
    "Contoh respons",
    "Apakah ada urutan?",
    "Fungsi link",
    "Interpretasi koefisien",
    "Ukuran efek",
    "Fungsi R utama",
    "Asumsi khas"
  ),
  `Logistik Biner` = c(
    "Dikotomis (0/1)",
    "Body Signal of Smoking",
    "Perokok / Bukan Perokok",
    "Tidak (hanya 2 kategori)",
    "Logit: log(p/(1-p))",
    "Odds Ratio (OR)",
    "OR = exp(beta)",
    "stats::glm(family=binomial)",
    "Linearitas log-odds"
  ),
  `Logistik Multinomial` = c(
    "Nominal >= 3 kategori",
    "Sleep Health & Lifestyle",
    "None / Insomnia / Sleep Apnea",
    "Tidak",
    "Baseline-category logit",
    "Relative Risk Ratio (RRR)",
    "RRR = exp(beta_k)",
    "nnet::multinom()",
    "IIA"
  ),
  `Logistik Ordinal` = c(
    "Ordinal >= 3 jenjang",
    "Maternal Health Risk",
    "Low / Mid / High Risk",
    "Ya",
    "Cumulative logit",
    "Odds Ratio kumulatif",
    "OR = exp(-beta_polr)",
    "MASS::polr()",
    "Proportional odds"
  ),
  `Regresi Poisson` = c(
    "Hitungan (0, 1, 2, ...)",
    "Absenteeism at Work",
    "Jam ketidakhadiran",
    "Tidak berlaku",
    "Log: log(mu)",
    "Incidence Rate Ratio (IRR)",
    "IRR = exp(beta)",
    "stats::glm(family=poisson)",
    "Equidispersion (mean=varians)"
  )
)

knitr::kable(tbl_comp, caption="Perbandingan keempat model regresi logistik")
Perbandingan keempat model regresi logistik
Aspek Logistik Biner Logistik Multinomial Logistik Ordinal Regresi Poisson
Jenis variabel respons Dikotomis (0/1) Nominal >= 3 kategori Ordinal >= 3 jenjang Hitungan (0, 1, 2, …)
Dataset dalam studi ini Body Signal of Smoking Sleep Health & Lifestyle Maternal Health Risk Absenteeism at Work
Contoh respons Perokok / Bukan Perokok None / Insomnia / Sleep Apnea Low / Mid / High Risk Jam ketidakhadiran
Apakah ada urutan? Tidak (hanya 2 kategori) Tidak Ya Tidak berlaku
Fungsi link Logit: log(p/(1-p)) Baseline-category logit Cumulative logit Log: log(mu)
Interpretasi koefisien Odds Ratio (OR) Relative Risk Ratio (RRR) Odds Ratio kumulatif Incidence Rate Ratio (IRR)
Ukuran efek OR = exp(beta) RRR = exp(beta_k) OR = exp(-beta_polr) IRR = exp(beta)
Fungsi R utama stats::glm(family=binomial) nnet::multinom() MASS::polr() stats::glm(family=poisson)
Asumsi khas Linearitas log-odds IIA Proportional odds Equidispersion (mean=varians)

10 Kesimpulan Umum

Berdasarkan hasil analisis yang telah dilakukan pada keempat model regresi, dapat disimpulkan bahwa:

  1. Model Biner — Hemoglobin, GTP, dan jenis kelamin adalah prediktor kuat status perokok. AUC > 0.7 menunjukkan kemampuan diskriminasi yang baik.

  2. Model Multinomial — Tingkat stres memengaruhi insomnia, sementara BMI dan usia lebih terkait dengan sleep apnea. Kategori referensi adalah “None”.

  3. Model Ordinal — Tekanan darah sistolik, kadar gula darah, dan usia adalah prediktor utama tingkat risiko kehamilan. Asumsi proportional odds terpenuhi.

  4. Model Poisson — Kebiasaan minum alkohol dan jumlah anak meningkatkan rata-rata jam ketidakhadiran. Pemeriksaan overdispersion menentukan model alternatif yang diperlukan.

Pemilihan jenis model harus selalu disesuaikan dengan karakteristik variabel respons dan divalidasi melalui pemeriksaan asumsi yang teliti.


11 Referensi

  1. Agresti, A. (2019). An Introduction to Categorical Data Analysis (3rd ed.). Wiley.
  2. Agresti, A. (2012). Categorical Data Analysis (3rd ed.). Wiley.
  3. Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
  4. Hilbe, J. M. (2014). Modeling Count Data. Cambridge University Press.
  5. McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall.
  6. Dataset Biner: https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking
  7. Dataset Multinomial: https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset
  8. Dataset Ordinal: https://www.kaggle.com/datasets/csafrit2/maternal-health-risk-data
  9. Dataset Poisson: https://archive.ics.uci.edu/dataset/445/absenteeism+at+work