Tujuan analisis
Laporan ini
menganalisis empat model regresi penting untuk respons non-kontinu:
| Jenis Respons | Contoh | Model |
|---|---|---|
| Biner (0/1) | Kunjungan tinggi vs rendah | Regresi Logistik Biner |
| Nominal / tidak berurutan | Pilihan moda transportasi | Regresi Multinomial |
| Ordinal / berurutan | Tingkat kepuasan pasien | Regresi Ordinal |
| Hitungan / count | Jumlah kunjungan dokter | Regresi Poisson |
Regresi logistik biner bertanya: > Bagaimana prediktor memengaruhi peluang terjadinya suatu kejadian (Ya/Tidak)?
Regresi logistik multinomial bertanya: > Dibandingkan kategori referensi, bagaimana peluang seseorang masuk kategori A, B, atau C berubah ketika prediktor meningkat?
Regresi logistik ordinal bertanya: > Bagaimana prediktor menggeser kecenderungan seseorang berada pada tingkat respons yang lebih tinggi atau lebih rendah?
Regresi Poisson bertanya: > Bagaimana prediktor memengaruhi rata-rata jumlah kejadian yang diamati?
| Aspek | Biner | Multinomial | Ordinal | Poisson |
|---|---|---|---|---|
| Skala respons | Biner (0/1) | Nominal | Ordinal | Count (≥ 0) |
| Fungsi penghubung | Logit | Logit (baseline-cat.) | Cumulative logit | Log |
| Interpretasi exp(β) | Odds Ratio (OR) | Relative Risk Ratio (RRR) | Odds Ratio (OR) | Incidence Rate Ratio (IRR) |
| Paket R | stats::glm |
nnet::multinom |
MASS::polr |
stats::glm |
Sumber data: Adaptasi dari Deb & Trivedi (1997), Demand for Medical Care by the Elderly, Journal of Applied Econometrics. Variabel dependen Poisson (Y): Jumlah kunjungan ke dokter dalam setahun. Variabel dependen Biner (Y): Kunjungan tinggi (≥ 5 kali) vs kunjungan rendah (< 5 kali). Variabel independen: Usia, Pendapatan (juta Rp/bulan), Skor kesehatan (1–10), Asuransi (1 = Ya, 0 = Tidak). n = 20 observasi
# FIX: asuransi dibiarkan numerik (0/1) agar glm Poisson tidak error
# FIX: data ini dipakai bersama untuk Bagian I (Biner) dan Bagian IV (Poisson)
data_pois <- tibble(
usia = c(25, 30, 35, 40, 45, 50, 55, 60, 28, 33,
38, 43, 48, 53, 58, 27, 32, 37, 42, 47),
pendapatan = c( 3, 4, 5, 6, 7, 8, 9, 10,3.5,4.5,
5.5,6.5,7.5,8.5,9.5,3.2,4.2,5.2,6.2,7.2),
skor_kes = c( 8, 7, 6, 5, 4, 3, 2, 1, 8, 7,
6, 5, 4, 3, 2, 8, 7, 6, 5, 4),
asuransi = c( 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 1),
kunjungan = c( 1, 2, 2, 3, 4, 5, 6, 7, 1, 2,
3, 3, 4, 5, 6, 1, 2, 3, 4, 5)
)
knitr::kable(data_pois, caption = "Data Kunjungan ke Dokter (n = 20)")| usia | pendapatan | skor_kes | asuransi | kunjungan |
|---|---|---|---|---|
| 25 | 3.0 | 8 | 1 | 1 |
| 30 | 4.0 | 7 | 1 | 2 |
| 35 | 5.0 | 6 | 1 | 2 |
| 40 | 6.0 | 5 | 1 | 3 |
| 45 | 7.0 | 4 | 1 | 4 |
| 50 | 8.0 | 3 | 1 | 5 |
| 55 | 9.0 | 2 | 1 | 6 |
| 60 | 10.0 | 1 | 1 | 7 |
| 28 | 3.5 | 8 | 0 | 1 |
| 33 | 4.5 | 7 | 0 | 2 |
| 38 | 5.5 | 6 | 0 | 3 |
| 43 | 6.5 | 5 | 0 | 3 |
| 48 | 7.5 | 4 | 0 | 4 |
| 53 | 8.5 | 3 | 0 | 5 |
| 58 | 9.5 | 2 | 0 | 6 |
| 27 | 3.2 | 8 | 1 | 1 |
| 32 | 4.2 | 7 | 1 | 2 |
| 37 | 5.2 | 6 | 0 | 3 |
| 42 | 6.2 | 5 | 0 | 4 |
| 47 | 7.2 | 4 | 1 | 5 |
Sumber data: Adaptasi dari Greene (2012), Econometric Analysis, 7th ed. Variabel dependen (Y): Moda transportasi yang dipilih — Bus, Kereta, atau Mobil Pribadi. n = 20 observasi
data_multi <- tibble(
pendapatan = c(3.2, 7.5, 4.8, 2.1, 9.3, 5.6, 1.8,11.2, 6.4, 3.9,
8.7, 5.1, 2.7,13.5, 4.3, 6.9, 2.4, 7.8,10.1, 3.5),
jarak = c( 12, 35, 22, 8, 45, 28, 6, 52, 30, 15,
40, 25, 10, 60, 20, 33, 9, 38, 50, 14),
usia = c( 28, 42, 35, 23, 50, 38, 20, 55, 45, 31,
48, 36, 25, 58, 33, 41, 22, 46, 53, 29),
gender = c("L","L","P","P","L","P","L","L","P","L",
"P","L","P","L","P","L","L","P","L","P"),
moda = c("Bus","Mobil","Kereta","Bus","Mobil","Kereta",
"Bus","Mobil","Kereta","Bus","Mobil","Kereta",
"Bus","Mobil","Kereta","Mobil","Bus","Kereta",
"Mobil","Bus")
)
data_multi$moda <- relevel(factor(data_multi$moda), ref = "Bus")
knitr::kable(data_multi, caption = "Data Pemilihan Moda Transportasi (n = 20)")| pendapatan | jarak | usia | gender | moda |
|---|---|---|---|---|
| 3.2 | 12 | 28 | L | Bus |
| 7.5 | 35 | 42 | L | Mobil |
| 4.8 | 22 | 35 | P | Kereta |
| 2.1 | 8 | 23 | P | Bus |
| 9.3 | 45 | 50 | L | Mobil |
| 5.6 | 28 | 38 | P | Kereta |
| 1.8 | 6 | 20 | L | Bus |
| 11.2 | 52 | 55 | L | Mobil |
| 6.4 | 30 | 45 | P | Kereta |
| 3.9 | 15 | 31 | L | Bus |
| 8.7 | 40 | 48 | P | Mobil |
| 5.1 | 25 | 36 | L | Kereta |
| 2.7 | 10 | 25 | P | Bus |
| 13.5 | 60 | 58 | L | Mobil |
| 4.3 | 20 | 33 | P | Kereta |
| 6.9 | 33 | 41 | L | Mobil |
| 2.4 | 9 | 22 | L | Bus |
| 7.8 | 38 | 46 | P | Kereta |
| 10.1 | 50 | 53 | L | Mobil |
| 3.5 | 14 | 29 | P | Bus |
Sumber data: Dataset Patient Satisfaction dari
Kaggle (Dimitrievska, 2020). Dataset berisi 453 responden yang menilai
kualitas pelayanan kesehatan di North Macedonia. Variabel
dependen (Y): satisfaction in RM (1 = Tidak Puas,
2 = Sebagian Puas, 3 = Puas). Variabel independen:
Check up appointment, Time waiting, Admin procedures, Hygiene and
cleaning, Communication with dr, Exact diagnosis, Modern equipment.
n = 453 observasi
# Sesuaikan path file CSV di bawah ini
data_ord_raw <- read_csv("C:/Users/Lenovo/OneDrive/Desktop/SEMESTER 4/ADK/adk tiba tiba reglog lengkap/datasetsatisfaction.csv")
data_ord <- data_ord_raw %>%
rename(
kepuasan = `satisfaction in RM`,
checkup = `Check up appointment`,
waiting = `Time waiting`,
admin = `Admin procedures`,
hygiene = `Hygiene and cleaning`,
communication = `Communication with dr`,
diagnosis = `Exact diagnosis`,
equipment = `Modern equipment`
) %>%
# FIX: pastikan semua prediktor numerik (bukan karakter)
mutate(across(c(checkup, waiting, admin, hygiene,
communication, diagnosis, equipment),
as.numeric)) %>%
na.omit() %>%
mutate(
kepuasan = ordered(kepuasan,
levels = 1:3,
labels = c("Tidak Puas", "Sebagian Puas", "Puas"))
)
cat("Dimensi data ordinal:", nrow(data_ord), "x", ncol(data_ord), "\n")## Dimensi data ordinal: 452 x 17
## # A tibble: 6 × 17
## kepuasan checkup waiting admin hygiene `Time of appointment`
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tidak Puas 1 1 1 1 1
## 2 Sebagian Puas 2 1 2 1 2
## 3 Sebagian Puas 1 1 2 2 2
## 4 Sebagian Puas 2 4 1 1 2
## 5 Puas 4 1 1 2 1
## 6 Sebagian Puas 2 3 4 1 1
## # ℹ 11 more variables: `Quality/experience dr.` <dbl>,
## # `Specialists avaliable` <dbl>, communication <dbl>, diagnosis <dbl>,
## # equipment <dbl>, `friendly health care workers` <dbl>,
## # `lab services` <dbl>, `avaliablity of drugs` <dbl>, `waiting rooms` <dbl>,
## # `hospital rooms quality` <dbl>, `parking, playing rooms, caffes` <dbl>
Regresi logistik biner digunakan ketika variabel respons hanya memiliki dua kategori:
\[Y_i \in \{0, 1\}\]
Model memodelkan probabilitas kejadian \(Y_i = 1\) melalui fungsi logit:
\[\text{logit}(\pi_i) = \log\left(\frac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}\]
sehingga peluang prediksi:
\[\hat{\pi}_i = \frac{1}{1 + \exp(-\hat{\eta}_i)}\]
Data menggunakan dataset yang sama dengan regresi Poisson. Variabel respons dibentuk menjadi:
data_bin <- data_pois %>%
mutate(kunjungan_biner = as.integer(kunjungan >= 5))
knitr::kable(data_bin, caption = "Data Regresi Logistik Biner")| usia | pendapatan | skor_kes | asuransi | kunjungan | kunjungan_biner |
|---|---|---|---|---|---|
| 25 | 3.0 | 8 | 1 | 1 | 0 |
| 30 | 4.0 | 7 | 1 | 2 | 0 |
| 35 | 5.0 | 6 | 1 | 2 | 0 |
| 40 | 6.0 | 5 | 1 | 3 | 0 |
| 45 | 7.0 | 4 | 1 | 4 | 0 |
| 50 | 8.0 | 3 | 1 | 5 | 1 |
| 55 | 9.0 | 2 | 1 | 6 | 1 |
| 60 | 10.0 | 1 | 1 | 7 | 1 |
| 28 | 3.5 | 8 | 0 | 1 | 0 |
| 33 | 4.5 | 7 | 0 | 2 | 0 |
| 38 | 5.5 | 6 | 0 | 3 | 0 |
| 43 | 6.5 | 5 | 0 | 3 | 0 |
| 48 | 7.5 | 4 | 0 | 4 | 0 |
| 53 | 8.5 | 3 | 0 | 5 | 1 |
| 58 | 9.5 | 2 | 0 | 6 | 1 |
| 27 | 3.2 | 8 | 1 | 1 | 0 |
| 32 | 4.2 | 7 | 1 | 2 | 0 |
| 37 | 5.2 | 6 | 0 | 3 | 0 |
| 42 | 6.2 | 5 | 0 | 4 | 0 |
| 47 | 7.2 | 4 | 1 | 5 | 1 |
data_bin %>%
count(kunjungan_biner) %>%
mutate(
Label = ifelse(kunjungan_biner == 1, "Tinggi (≥5)", "Rendah (<5)"),
Persen = round(n / sum(n) * 100, 1)
) %>%
knitr::kable(
col.names = c("Kategori (0/1)", "Label", "Frekuensi", "Persentase (%)"),
caption = "Distribusi Respons Biner"
)| Kategori (0/1) | Label | Frekuensi | Persentase (%) |
|---|---|---|---|
| 0 | 14 | Rendah (<5) | 70 |
| 1 | 6 | Tinggi (≥5) | 30 |
\[\log\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1(\text{usia}) + \beta_2(\text{pendapatan}) + \beta_3(\text{skor\_kes}) + \beta_4(\text{asuransi})\]
dengan \(Y = 1\) = kunjungan tinggi, \(Y = 0\) = kunjungan rendah.
Model logistik biner diestimasi menggunakan Maximum Likelihood Estimation (MLE):
\[L(\beta) = \prod_{i=1}^{n} \pi_i^{y_i}(1-\pi_i)^{1-y_i}\]
\[\ell(\beta) = \sum_{i=1}^{n}\left[y_i\log(\pi_i) + (1-y_i)\log(1-\pi_i)\right]\]
safe_div <- function(num, den) ifelse(den == 0, NA_real_, num / den)
stratified_split <- function(y, prop = 0.8) {
idx_by_class <- split(seq_along(y), y)
train_idx <- lapply(
idx_by_class,
function(idx) sample(idx, size = floor(length(idx) * prop))
)
unlist(train_idx, use.names = FALSE)
}
set.seed(42)
train_id <- stratified_split(data_bin$kunjungan_biner, prop = 0.8)
train_bin <- data_bin[train_id, ]
test_bin <- data_bin[-train_id, ]
bind_rows(
train_bin %>% count(kunjungan_biner) %>% mutate(Set = "Training"),
test_bin %>% count(kunjungan_biner) %>% mutate(Set = "Testing")
) %>%
group_by(Set) %>%
mutate(Proporsi = scales::percent(n / sum(n), accuracy = 0.1)) %>%
ungroup() %>%
rename(`Kelas Y` = kunjungan_biner, Jumlah = n) %>%
dplyr::select(Set, `Kelas Y`, Jumlah, Proporsi) %>%
knitr::kable(caption = "Distribusi Kelas pada Training dan Testing")| Set | Kelas Y | Jumlah | Proporsi |
|---|---|---|---|
| Training | 0 | 11 | 73.3% |
| Training | 1 | 4 | 26.7% |
| Testing | 0 | 3 | 60.0% |
| Testing | 1 | 2 | 40.0% |
model_bin <- glm(
kunjungan_biner ~ usia + pendapatan + skor_kes + asuransi,
data = train_bin,
family = binomial(link = "logit")
)
summary(model_bin)##
## Call:
## glm(formula = kunjungan_biner ~ usia + pendapatan + skor_kes +
## asuransi, family = binomial(link = "logit"), data = train_bin)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.823e+03 1.393e+07 0 1
## usia 1.187e+02 4.476e+05 0 1
## pendapatan -5.298e+02 2.344e+06 0 1
## skor_kes 1.913e+01 1.087e+06 0 1
## asuransi 8.857e+01 6.009e+05 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.7397e+01 on 14 degrees of freedom
## Residual deviance: 8.8423e-10 on 10 degrees of freedom
## AIC: 10
##
## Number of Fisher Scoring iterations: 25
or_bin <- exp(coef(model_bin))
ci_bin <- exp(confint.default(model_bin))
cbind(
OR = round(or_bin, 3),
`CI 2.5%` = round(ci_bin[, 1], 3),
`CI 97.5%` = round(ci_bin[, 2], 3)
) %>%
knitr::kable(caption = "Odds Ratio (OR) dan Confidence Interval 95%")| OR | CI 2.5% | CI 97.5% | |
|---|---|---|---|
| (Intercept) | 0.000000e+00 | 0 | Inf |
| usia | 3.573659e+51 | 0 | Inf |
| pendapatan | 0.000000e+00 | 0 | Inf |
| skor_kes | 2.027883e+08 | 0 | Inf |
| asuransi | 2.909858e+38 | 0 | Inf |
\[OR = e^{\hat{\beta}}\]
| Nilai OR | Interpretasi |
|---|---|
| OR > 1 | Meningkatkan peluang kunjungan tinggi |
| OR < 1 | Menurunkan peluang kunjungan tinggi |
| OR = 1 | Tidak berpengaruh |
Contoh:
\[\text{Akurasi} = \frac{TP+TN}{TP+TN+FP+FN}, \quad \text{Sensitivity} = \frac{TP}{TP+FN}, \quad \text{Specificity} = \frac{TN}{TN+FP}\]
# Fungsi metrik — mengikuti gaya dosen (referensi kredit macet)
classification_metrics <- function(actual, prob, threshold = 0.5) {
pred <- as.integer(prob >= threshold)
tp <- sum(pred == 1 & actual == 1)
tn <- sum(pred == 0 & actual == 0)
fp <- sum(pred == 1 & actual == 0)
fn <- sum(pred == 0 & actual == 1)
data.frame(
Threshold = threshold,
Akurasi = safe_div(tp + tn, tp + tn + fp + fn),
`Error rate` = 1 - safe_div(tp + tn, tp + tn + fp + fn),
Sensitivity = safe_div(tp, tp + fn),
Specificity = safe_div(tn, tn + fp),
Presisi = safe_div(tp, tp + fp),
NPV = safe_div(tn, tn + fn),
`F1-score` = safe_div(2 * tp, 2 * tp + fp + fn),
`Balanced accuracy`= (safe_div(tp, tp + fn) + safe_div(tn, tn + fp)) / 2,
FPR = 1 - safe_div(tn, tn + fp),
FNR = 1 - safe_div(tp, tp + fn),
check.names = FALSE
)
}
confusion_matrix_tbl <- function(actual, prob, threshold = 0.5) {
pred <- factor(ifelse(prob >= threshold, "Tinggi", "Rendah"),
levels = c("Rendah", "Tinggi"))
act_f <- factor(ifelse(actual == 1, "Tinggi", "Rendah"),
levels = c("Rendah", "Tinggi"))
as.data.frame.matrix(table(Prediksi = pred, Aktual = act_f))
}p_train_b <- predict(model_bin, newdata = train_bin, type = "response")
p_test_b <- predict(model_bin, newdata = test_bin, type = "response")
# Contoh prediksi pada data testing
head(data.frame(
`Y aktual` = test_bin$kunjungan_biner,
`Peluang (tinggi)` = round(p_test_b, 4),
`Prediksi c=0.50` = ifelse(p_test_b >= 0.5, 1, 0),
check.names = FALSE
), 8) %>%
knitr::kable(caption = "Contoh Prediksi Probabilitas pada Data Testing")| Y aktual | Peluang (tinggi) | Prediksi c=0.50 |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 1 |
| 1 | 1 | 1 |
| 0 | 0 | 0 |
| 0 | 0 | 0 |
cm_50 <- confusion_matrix_tbl(test_bin$kunjungan_biner, p_test_b, 0.5)
knitr::kable(cm_50, caption = "Confusion Matrix Testing — Threshold 0.50")| Rendah | Tinggi | |
|---|---|---|
| Rendah | 3 | 0 |
| Tinggi | 0 | 2 |
met_50 <- classification_metrics(test_bin$kunjungan_biner, p_test_b, 0.5) %>%
mutate(across(where(is.numeric), ~ round(.x, 3)))
knitr::kable(met_50, caption = "Metrik Evaluasi Testing — Threshold 0.50")| Threshold | Akurasi | Error rate | Sensitivity | Specificity | Presisi | NPV | F1-score | Balanced accuracy | FPR | FNR |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.5 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
Indeks Youden: \(J = \text{Sensitivity} + \text{Specificity} - 1\)
Threshold optimal \(c^*\) adalah nilai yang memaksimalkan \(J\).
roc_points <- function(actual, prob) {
thresholds <- c(Inf, sort(unique(prob), decreasing = TRUE), -Inf)
out <- lapply(thresholds, function(th) {
pred <- as.integer(prob >= th)
tp <- sum(pred == 1 & actual == 1)
tn <- sum(pred == 0 & actual == 0)
fp <- sum(pred == 1 & actual == 0)
fn <- sum(pred == 0 & actual == 1)
sens <- safe_div(tp, tp + fn)
spec <- safe_div(tn, tn + fp)
data.frame(threshold = th,
sensitivity = sens,
specificity = spec,
fpr = 1 - spec,
youden = sens + spec - 1)
})
bind_rows(out)
}
auc_value <- function(roc_df) {
roc_sorted <- roc_df %>% arrange(fpr, sensitivity)
sum(diff(roc_sorted$fpr) *
(head(roc_sorted$sensitivity, -1) + tail(roc_sorted$sensitivity, -1)) / 2)
}
roc_train_b <- roc_points(train_bin$kunjungan_biner, p_train_b) %>%
mutate(data = "Training")
roc_test_b <- roc_points(test_bin$kunjungan_biner, p_test_b) %>%
mutate(data = "Testing")
auc_train_b <- auc_value(roc_train_b)
auc_test_b <- auc_value(roc_test_b)
optimal_train_b <- roc_train_b %>%
filter(is.finite(threshold)) %>%
arrange(desc(youden), desc(sensitivity)) %>%
dplyr::slice(1)
threshold_opt_b <- optimal_train_b$threshold[1]
test_at_opt_b <- roc_test_b %>%
filter(is.finite(threshold)) %>%
dplyr::slice_min(abs(threshold - threshold_opt_b), n = 1, with_ties = FALSE) %>%
mutate(data = "Testing pada threshold optimal")
data.frame(
Keterangan = c("Threshold optimal (Youden)", "Sensitivity (training)",
"Specificity (training)", "Indeks Youden (training)",
"AUC training", "AUC testing"),
Nilai = round(c(threshold_opt_b,
optimal_train_b$sensitivity,
optimal_train_b$specificity,
optimal_train_b$youden,
auc_train_b, auc_test_b), 4)
) %>%
knitr::kable(caption = "Threshold Optimal Berdasarkan ROC Training")| Keterangan | Nilai |
|---|---|
| Threshold optimal (Youden) | 1 |
| Sensitivity (training) | 1 |
| Specificity (training) | 1 |
| Indeks Youden (training) | 1 |
| AUC training | 1 |
| AUC testing | 1 |
ggplot(bind_rows(roc_train_b, roc_test_b),
aes(x = fpr, y = sensitivity, color = data)) +
geom_path(linewidth = 1.1) +
geom_abline(intercept = 0, slope = 1,
linetype = "dashed", color = "#6c757d") +
geom_point(data = optimal_train_b,
aes(x = fpr, y = sensitivity),
inherit.aes = FALSE,
color = "#ffb703", fill = "#fb8500",
shape = 21, size = 4, stroke = 1.2) +
geom_point(data = test_at_opt_b,
aes(x = fpr, y = sensitivity),
inherit.aes = FALSE,
color = "#8338ec", fill = "#3a86ff",
shape = 24, size = 4, stroke = 1.2) +
coord_equal() +
scale_color_manual(values = c("Training" = "#0077b6", "Testing" = "#e76f51")) +
labs(
title = "Kurva ROC Model Regresi Logistik Biner",
subtitle = paste0("AUC training = ", round(auc_train_b, 3),
"; AUC testing = ", round(auc_test_b, 3),
"; threshold optimal = ", round(threshold_opt_b, 3)),
x = "False positive rate (1 - specificity)",
y = "Sensitivity / true positive rate",
color = "Data"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")cm_opt_b <- confusion_matrix_tbl(test_bin$kunjungan_biner, p_test_b, threshold_opt_b)
knitr::kable(cm_opt_b,
caption = paste0("Confusion Matrix Testing — Threshold Optimal (",
round(threshold_opt_b, 3), ")"))| Rendah | Tinggi | |
|---|---|---|
| Rendah | 3 | 1 |
| Tinggi | 0 | 1 |
bind_rows(
classification_metrics(test_bin$kunjungan_biner, p_test_b, 0.5) %>%
mutate(Aturan = "Threshold 0.50"),
classification_metrics(test_bin$kunjungan_biner, p_test_b, threshold_opt_b) %>%
mutate(Aturan = paste0("Threshold optimal (", round(threshold_opt_b, 3), ")"))
) %>%
mutate(across(where(is.numeric), ~ round(.x, 3))) %>%
dplyr::select(Aturan, everything()) %>%
knitr::kable(caption = "Perbandingan Metrik Testing: Threshold Default vs Threshold Optimal")| Aturan | Threshold | Akurasi | Error rate | Sensitivity | Specificity | Presisi | NPV | F1-score | Balanced accuracy | FPR | FNR |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Threshold 0.50 | 0.5 | 1.0 | 0.0 | 1.0 | 1 | 1 | 1.00 | 1.000 | 1.00 | 0 | 0.0 |
| Threshold optimal (1) | 1.0 | 0.8 | 0.2 | 0.5 | 1 | 1 | 0.75 | 0.667 | 0.75 | 0 | 0.5 |
test_bin %>%
mutate(
peluang = p_test_b,
Status = ifelse(kunjungan_biner == 1, "Tinggi (≥5)", "Rendah (<5)")
) %>%
ggplot(aes(x = peluang, fill = Status)) +
geom_density(alpha = 0.55, color = "white", linewidth = 0.7) +
geom_vline(xintercept = threshold_opt_b,
color = "#fb8500", linewidth = 1.2, linetype = "dashed") +
annotate("label",
x = threshold_opt_b, y = Inf,
label = paste0("threshold = ", round(threshold_opt_b, 3)),
vjust = 1.4, fill = "#fff3b0", color = "#5f370e", label.size = 0) +
scale_fill_manual(values = c("Rendah (<5)" = "#e76f51",
"Tinggi (≥5)" = "#2f7f73")) +
labs(
title = "Distribusi Peluang Prediksi pada Data Testing",
subtitle = "Garis putus-putus oranye = threshold optimal",
x = "Peluang prediksi kunjungan tinggi",
y = "Kepadatan", fill = "Status aktual"
) +
theme_minimal(base_size = 12)# FIX: gunakan data training (bukan seluruh data) agar konsisten dengan model
hl_result <- hoslem.test(
x = train_bin$kunjungan_biner,
y = fitted(model_bin),
g = min(10, floor(nrow(train_bin) / 2))
)
print(hl_result)##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: train_bin$kunjungan_biner, fitted(model_bin)
## X-squared = 2.018e-11, df = 1, p-value = 1
Interpretasi: p-value > 0,05 → model sesuai dengan data (tidak ada lack of fit). p-value ≤ 0,05 → terdapat indikasi lack of fit, model perlu dievaluasi kembali.
data_bin %>%
mutate(prob_pred = predict(model_bin, newdata = data_bin, type = "response")) %>%
ggplot(aes(x = usia, y = prob_pred)) +
geom_point(aes(color = factor(kunjungan_biner)), size = 3) +
geom_smooth(method = "loess", se = FALSE, color = "#243142", linewidth = 1) +
scale_color_manual(values = c("0" = "#e76f51", "1" = "#2f7f73"),
labels = c("0" = "Rendah", "1" = "Tinggi")) +
labs(
title = "Probabilitas Prediksi Kunjungan Tinggi vs Usia",
subtitle = "Model Regresi Logistik Biner",
x = "Usia", y = "Probabilitas prediksi", color = "Aktual"
) +
theme_minimal(base_size = 12)Asumsi utama:
Contoh interpretasi:
Regresi logistik multinomial untuk respons kategorik nominal dengan \(K > 2\) kategori. Dengan kategori ke-\(K\) sebagai referensi, untuk \(k = 1, \ldots, K-1\):
\[\log\left(\frac{P(Y_i=k \mid \mathbf{x}_i)}{P(Y_i=K \mid \mathbf{x}_i)}\right) = \beta_{0k}+\beta_{1k}x_{i1}+\cdots+\beta_{pk}x_{ip}\]
Dalam kasus data transportasi (\(K = 3\), referensi = Bus):
\[\log\left(\frac{P(Y=\text{Kereta})}{P(Y=\text{Bus})}\right) = \beta_{0,K}+\beta_{1,K}\,\text{pendapatan}+\cdots\]
\[\log\left(\frac{P(Y=\text{Mobil})}{P(Y=\text{Bus})}\right) = \beta_{0,M}+\beta_{1,M}\,\text{pendapatan}+\cdots\]
Probabilitas untuk kategori \(k\):
\[P(Y_i=k \mid \mathbf{x}_i) = \frac{\exp(\eta_{ik})}{1+\displaystyle\sum_{h=1}^{K-1}\exp(\eta_{ih})}\]
Log-likelihood:
\[\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n}\sum_{k=1}^{K} \mathbf{1}(y_i=k)\,\log\,P(Y_i=k \mid \mathbf{x}_i)\]
Relative Risk Ratio (RRR) = exp(β): seberapa besar peluang relatif memilih kategori target dibanding referensi untuk setiap kenaikan satu unit prediktor.
data_multi %>%
count(moda) %>%
mutate(persen = round(n / sum(n) * 100, 1)) %>%
knitr::kable(col.names = c("Moda", "n", "%"),
caption = "Distribusi Frekuensi Moda Transportasi")| Moda | n | % |
|---|---|---|
| Bus | 7 | 35 |
| Kereta | 6 | 30 |
| Mobil | 7 | 35 |
model_multi <- multinom(
moda ~ pendapatan + jarak + usia + gender,
data = data_multi,
trace = FALSE
)
summary(model_multi)## Call:
## multinom(formula = moda ~ pendapatan + jarak + usia + gender,
## data = data_multi, trace = FALSE)
##
## Coefficients:
## (Intercept) pendapatan jarak usia genderP
## Kereta -7.352533 -34.283072 10.03278 -1.164749 9.410764
## Mobil 0.165724 4.178115 10.82538 -8.177491 -15.982850
##
## Std. Errors:
## (Intercept) pendapatan jarak usia genderP
## Kereta 32.00714 357.9181 680.9530 309.4647 280.3828
## Mobil 48.66424 407.0119 291.4715 146.4277 233.0251
##
## Residual Deviance: 0.0001247089
## AIC: 20.00012
z_stat <- summary(model_multi)$coefficients /
summary(model_multi)$standard.errors
p_val <- 2 * (1 - pnorm(abs(z_stat)))
cat("=== z-statistik ===\n"); print(round(z_stat, 3))## === z-statistik ===
## (Intercept) pendapatan jarak usia genderP
## Kereta -0.230 -0.096 0.015 -0.004 0.034
## Mobil 0.003 0.010 0.037 -0.056 -0.069
##
## === p-value ===
## (Intercept) pendapatan jarak usia genderP
## Kereta 0.8183 0.9237 0.9882 0.9970 0.9732
## Mobil 0.9973 0.9918 0.9704 0.9555 0.9453
exp(coef(model_multi)) %>%
round(3) %>%
knitr::kable(caption = "Relative Risk Ratio (RRR = exp(β)) Model Multinomial")| (Intercept) | pendapatan | jarak | usia | genderP | |
|---|---|---|---|---|---|
| Kereta | 0.001 | 0.000 | 22760.54 | 0.312 | 12219.2 |
| Mobil | 1.180 | 65.243 | 50280.68 | 0.000 | 0.0 |
Cara membaca RRR (referensi = Bus):
fitted(model_multi) %>%
round(3) %>%
knitr::kable(caption = "Prediksi Probabilitas per Observasi (Model Multinomial)")| Bus | Kereta | Mobil |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| 1 | 0 | 0 |
Asumsi utama:
Proportional odds model untuk respons kategorik ordinal:
\[\text{logit}[P(Y_i \leq j \mid \mathbf{x}_i)] = \alpha_j - \mathbf{x}_i^\top\boldsymbol{\beta}, \quad j = 1, \ldots, J-1\]
Konvensi tanda MASS::polr: koefisien
positif berarti prediktor meningkatkan \(P(Y \leq j)\) → respons cenderung ke
kategori lebih rendah.
\[\ell(\boldsymbol{\alpha},\boldsymbol{\beta}) = \sum_{i=1}^{n}\sum_{j=1}^{J}\mathbf{1}(y_i=j)\log P(Y_i=j \mid \mathbf{x}_i)\]
data_ord %>%
count(kepuasan) %>%
mutate(persen = round(n / sum(n) * 100, 1)) %>%
knitr::kable(col.names = c("Kepuasan", "n", "%"),
caption = "Distribusi Tingkat Kepuasan Pasien")| Kepuasan | n | % |
|---|---|---|
| Tidak Puas | 7 | 1.5 |
| Sebagian Puas | 241 | 53.3 |
| Puas | 204 | 45.1 |
model_ord <- MASS::polr(
kepuasan ~ checkup + waiting + admin + hygiene +
communication + diagnosis + equipment,
data = data_ord,
method = "logistic",
Hess = TRUE
)
summary(model_ord)## Call:
## MASS::polr(formula = kepuasan ~ checkup + waiting + admin + hygiene +
## communication + diagnosis + equipment, data = data_ord, Hess = TRUE,
## method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## checkup 0.03528 0.08588 0.4108
## waiting -0.06990 0.09735 -0.7181
## admin -0.04947 0.09695 -0.5102
## hygiene -0.09708 0.10338 -0.9391
## communication -0.04176 0.09945 -0.4199
## diagnosis 0.33403 0.09106 3.6684
## equipment -0.20987 0.09564 -2.1944
##
## Intercepts:
## Value Std. Error t value
## Tidak Puas|Sebagian Puas -4.4305 0.4603 -9.6245
## Sebagian Puas|Puas 0.0186 0.2690 0.0692
##
## Residual Deviance: 662.8506
## AIC: 680.8506
ctable <- coef(summary(model_ord))
p <- 2 * pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)
cbind(round(ctable, 4), `p-value` = round(p, 4)) %>%
knitr::kable(caption = "Koefisien dan p-value Model Ordinal")| Value | Std. Error | t value | p-value | |
|---|---|---|---|---|
| checkup | 0.0353 | 0.0859 | 0.4108 | 0.6812 |
| waiting | -0.0699 | 0.0974 | -0.7181 | 0.4727 |
| admin | -0.0495 | 0.0970 | -0.5102 | 0.6099 |
| hygiene | -0.0971 | 0.1034 | -0.9391 | 0.3477 |
| communication | -0.0418 | 0.0994 | -0.4199 | 0.6745 |
| diagnosis | 0.3340 | 0.0911 | 3.6684 | 0.0002 |
| equipment | -0.2099 | 0.0956 | -2.1944 | 0.0282 |
| Tidak Puas|Sebagian Puas | -4.4305 | 0.4603 | -9.6245 | 0.0000 |
| Sebagian Puas|Puas | 0.0186 | 0.2690 | 0.0692 | 0.9448 |
exp(coef(model_ord)) %>%
round(3) %>%
knitr::kable(col.names = "Odds Ratio",
caption = "Odds Ratio Model Regresi Logistik Ordinal")| Odds Ratio | |
|---|---|
| checkup | 1.036 |
| waiting | 0.932 |
| admin | 0.952 |
| hygiene | 0.907 |
| communication | 0.959 |
| diagnosis | 1.397 |
| equipment | 0.811 |
Berdasarkan hasil estimasi, variabel Exact Diagnosis dan Modern Equipment berpengaruh signifikan terhadap tingkat kepuasan pasien pada taraf signifikansi 5%.
predict(model_ord, type = "probs") %>%
round(3) %>%
head(10) %>%
knitr::kable(caption = "Prediksi Probabilitas 10 Observasi Pertama (Model Ordinal)")| Tidak Puas | Sebagian Puas | Puas |
|---|---|---|
| 0.013 | 0.516 | 0.471 |
| 0.017 | 0.578 | 0.405 |
| 0.006 | 0.345 | 0.649 |
| 0.006 | 0.324 | 0.670 |
| 0.005 | 0.294 | 0.701 |
| 0.014 | 0.537 | 0.449 |
| 0.017 | 0.581 | 0.402 |
| 0.005 | 0.313 | 0.681 |
| 0.032 | 0.709 | 0.259 |
| 0.016 | 0.562 | 0.422 |
model_ord_clm <- ordinal::clm(
kepuasan ~ checkup + waiting + admin + hygiene +
communication + diagnosis + equipment,
data = data_ord
)
nominal_test(model_ord_clm)## Tests of nominal effects
##
## formula: kepuasan ~ checkup + waiting + admin + hygiene + communication + diagnosis + equipment
## Df logLik AIC LRT Pr(>Chi)
## <none> -331.43 680.85
## checkup 1 -331.07 682.15 0.7029 0.40182
## waiting 1 -330.56 681.12 1.7351 0.18776
## admin 1 -331.41 682.83 0.0220 0.88208
## hygiene 1 -329.72 679.45 3.4040 0.06504 .
## communication 1 -330.90 681.79 1.0566 0.30400
## diagnosis 1 -330.75 681.50 1.3490 0.24546
## equipment 1 -331.35 682.69 0.1578 0.69120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretasi: p-value > 0,05 → asumsi proportional odds tidak ditolak, model valid. Jika dilanggar → pertimbangkan partial proportional odds atau multinomial.
Regresi Poisson untuk data hitung \(Y_i \in \{0,1,2,\ldots\}\). Fungsi penghubung log:
\[\log(\mu_i) = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}\]
Asumsi: \(E(Y_i) = \text{Var}(Y_i) = \mu_i\) (equidispersion).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 3.00 3.45 5.00 7.00
cat(sprintf("\nMean : %.2f\nVarians : %.2f\n",
mean(data_pois$kunjungan),
var(data_pois$kunjungan)))##
## Mean : 3.45
## Varians : 3.21
model_pois <- glm(
kunjungan ~ usia + pendapatan + skor_kes + asuransi,
data = data_pois,
family = poisson(link = "log")
)
summary(model_pois)##
## Call:
## glm(formula = kunjungan ~ usia + pendapatan + skor_kes + asuransi,
## family = poisson(link = "log"), data = data_pois)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.3713 13.3880 0.327 0.744
## usia 0.2747 0.4000 0.687 0.492
## pendapatan -1.7980 2.5077 -0.717 0.473
## skor_kes -0.6780 1.1882 -0.571 0.568
## asuransi -0.1500 0.5176 -0.290 0.772
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18.36058 on 19 degrees of freedom
## Residual deviance: 0.99382 on 15 degrees of freedom
## AIC: 70.777
##
## Number of Fisher Scoring iterations: 4
cbind(
IRR = round(exp(coef(model_pois)), 3),
`CI 2.5%` = round(exp(confint.default(model_pois)[, 1]), 3),
`CI 97.5%` = round(exp(confint.default(model_pois)[, 2]), 3)
) %>%
knitr::kable(caption = "Incidence Rate Ratio (IRR) dan Confidence Interval 95%")| IRR | CI 2.5% | CI 97.5% | |
|---|---|---|---|
| (Intercept) | 79.143 | 0.000 | 1.969128e+13 |
| usia | 1.316 | 0.601 | 2.883000e+00 |
| pendapatan | 0.166 | 0.001 | 2.257900e+01 |
| skor_kes | 0.508 | 0.049 | 5.211000e+00 |
| asuransi | 0.861 | 0.312 | 2.374000e+00 |
data_pois %>%
mutate(fitted = fitted(model_pois)) %>%
ggplot(aes(x = fitted, y = kunjungan)) +
geom_point(color = "#2f7f73", size = 3, alpha = 0.8) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", color = "#d18b2f", linewidth = 1) +
labs(
title = "Nilai Aktual vs Terprediksi",
subtitle = "Model Regresi Poisson — Kunjungan ke Dokter",
x = "Rata-rata Terprediksi (μ̂)", y = "Kunjungan Aktual (Y)"
) +
theme_minimal(base_size = 13)## Residual deviance : 0.994
## Degrees of freedom: 15
## Rasio dev/df : 0.066
##
## Overdispersion test
##
## data: model_pois
## z = -23.004, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 0.05855324
Rasio dev/df ≈ 1 dan p-value dispersion test > 0,05 → tidak ada overdispersion → model Poisson sesuai. Jika overdispersion → gunakan Negative Binomial.
# Alternatif jika overdispersion terdeteksi
model_nb <- MASS::glm.nb(
kunjungan ~ usia + pendapatan + skor_kes + asuransi,
data = data_pois
)
data.frame(
Model = c("Poisson", "Negative Binomial"),
AIC = round(c(AIC(model_pois), AIC(model_nb)), 3),
BIC = round(c(BIC(model_pois), BIC(model_nb)), 3),
Deviance = round(c(deviance(model_pois), deviance(model_nb)), 3)
) %>%
knitr::kable(caption = "Perbandingan: Poisson vs Negative Binomial")| Model | AIC | BIC | Deviance |
|---|---|---|---|
| Poisson | 70.777 | 75.756 | 0.994 |
| Negative Binomial | 72.778 | 78.752 | 0.994 |
| Aspek | Biner | Multinomial | Ordinal | Poisson |
|---|---|---|---|---|
| Tipe respons | Biner (0/1) | Nominal | Ordinal | Count |
| Distribusi | Binomial | Multinomial | Logistik kumulatif | Poisson |
| Fungsi link | Logit | Logit (baseline) | Cumulative logit | Log |
| Ukuran efek | OR | RRR | OR | IRR |
| Asumsi kritis | Linearitas logit | IIA | Proportional odds | Equidispersion |
| Alternatif | Probit | Nested logit | Partial PO | Negative Binomial |
| Paket R | glm |
multinom |
polr |
glm |
Checklist verifikasi:
polr): tanda koefisien
terbalik — koefisien positif → respons ke kategori
lebih rendah.Koefisien variabel [usia] sebesar [β] dengan OR = exp(β) = [nilai]. Setiap kenaikan satu tahun usia meningkatkan odds seseorang masuk kelompok kunjungan tinggi sebesar [(OR−1)×100%], dengan variabel lain konstan.
Dengan referensi Bus, RRR pendapatan untuk kategori Mobil = [nilai]. Setiap kenaikan Rp 1 juta pendapatan meningkatkan odds relatif memilih Mobil dibanding Bus sebesar [(RRR−1)×100%], variabel lain konstan.
OR variabel [waktu tunggu] = [nilai] > 1. Kenaikan satu menit waktu tunggu meningkatkan cumulative odds \(P(Y \leq j)/P(Y > j)\), sehingga respons cenderung ke kepuasan lebih rendah.
IRR variabel [skor kesehatan] = [nilai] < 1. Setiap kenaikan satu poin skor kesehatan menurunkan rata-rata kunjungan sebesar [(1−IRR)×100%], variabel lain konstan.
Ringkasan panduan: