Regresi logistik biner merupakan metode statistika yang digunakan untuk memodelkan hubungan antara satu atau lebih variabel prediktor dengan variabel respons dikotomi (dua kategori: 0/1). Metode ini sangat relevan dalam dunia perbankan, khususnya untuk memprediksi apakah seorang nasabah berpotensi mengalami kredit macet atau tidak.
Sumber Data: German Credit Dataset — tersedia melalui UCI Machine Learning Repository dan dapat diakses di R melalui package
caretdengan perintahdata(GermanCredit). Dataset ini berisi 1.000 observasi dengan 21 variabel. link data: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data
| Variabel | Tipe | Keterangan | Kategori/Satuan |
|---|---|---|---|
Class |
Respons | Status Kredit | Good (0) / Bad (1) |
Duration |
Numerik | Durasi kredit | Bulan |
Amount |
Numerik | Jumlah kredit | DM (Deutschmark) |
InstallmentRatePercentage |
Numerik | Tingkat cicilan | % dari pendapatan |
Age |
Numerik | Usia nasabah | Tahun |
NumberExistingCredits |
Numerik | Jumlah kredit aktif | Count |
CheckingAccountStatus |
Kategorik | Status rekening | lt0/0to200/gt200/none |
CreditHistory |
Kategorik | Riwayat kredit | 5 kategori |
# ============================================================
# REGRESI LOGISTIK BINER
# Dataset: German Credit (package caret)
# Sumber: https://archive.ics.uci.edu/dataset/144/
# ============================================================
# Muat dataset
data(GermanCredit)
df_biner <- GermanCredit
# Buat variabel respons biner: Good=0, Bad=1
df_biner$Class_bin <- ifelse(df_biner$Class == "Bad", 1, 0)
# Cek dimensi dan struktur
cat("Dimensi data:", dim(df_biner), "\n")## Dimensi data: 1000 63
str(df_biner[, c("Duration", "Amount", "Age",
"InstallmentRatePercentage",
"ResidenceDuration", "NumberExistingCredits",
"Class_bin")])## 'data.frame': 1000 obs. of 7 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ InstallmentRatePercentage: int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ Class_bin : num 0 1 0 0 1 0 0 0 0 1 ...
# Statistik deskriptif variabel kunci
summary(df_biner[, c("Duration", "Amount", "Age",
"InstallmentRatePercentage", "Class_bin")])## Duration Amount Age InstallmentRatePercentage
## Min. : 4.0 Min. : 250 Min. :19.00 Min. :1.000
## 1st Qu.:12.0 1st Qu.: 1366 1st Qu.:27.00 1st Qu.:2.000
## Median :18.0 Median : 2320 Median :33.00 Median :3.000
## Mean :20.9 Mean : 3271 Mean :35.55 Mean :2.973
## 3rd Qu.:24.0 3rd Qu.: 3972 3rd Qu.:42.00 3rd Qu.:4.000
## Max. :72.0 Max. :18424 Max. :75.00 Max. :4.000
## Class_bin
## Min. :0.0
## 1st Qu.:0.0
## Median :0.0
## Mean :0.3
## 3rd Qu.:1.0
## Max. :1.0
##
## Distribusi Status Kredit:
##
## Bad Good
## 300 700
##
## Proporsi (%):
##
## Bad Good
## 30 70
# Plot 1: Distribusi kelas
df_plot_b <- df_biner %>%
group_by(Class) %>%
summarise(n = n(), pct = n() / nrow(df_biner) * 100, .groups = "drop")
ggplot(df_plot_b, aes(x = Class, y = n, fill = Class)) +
geom_bar(stat = "identity", width = 0.5, show.legend = FALSE) +
geom_text(aes(label = paste0(n, " (", round(pct, 1), "%)")),
vjust = -0.5, fontface = "bold", size = 5) +
scale_fill_manual(values = c("Bad" = "#e94560", "Good" = "#0f3460")) +
labs(title = "Distribusi Status Kredit — German Credit Dataset",
x = "Status Kredit", y = "Frekuensi") +
theme_minimal(base_size = 13)# Plot 2: Jumlah kredit vs status
ggplot(df_biner, aes(x = Class, y = Amount, fill = Class)) +
geom_boxplot(alpha = 0.8, show.legend = FALSE) +
scale_fill_manual(values = c("Bad" = "#e94560", "Good" = "#0f3460")) +
labs(title = "Jumlah Kredit berdasarkan Status Kredit",
x = "Status Kredit", y = "Jumlah Kredit (DM)") +
theme_minimal(base_size = 13)# Plot 3: Distribusi usia
ggplot(df_biner, aes(x = Age, fill = Class)) +
geom_histogram(binwidth = 5, position = "dodge", alpha = 0.8) +
scale_fill_manual(values = c("Bad" = "#e94560", "Good" = "#0f3460")) +
labs(title = "Distribusi Usia berdasarkan Status Kredit",
x = "Usia (Tahun)", y = "Frekuensi") +
theme_minimal(base_size = 13)# Siapkan data model
df_model_b <- df_biner[, c("Class_bin", "Duration", "Amount", "Age",
"InstallmentRatePercentage",
"ResidenceDuration",
"NumberExistingCredits")]
df_model_b <- na.omit(df_model_b)
# Uji VIF dengan model linear bantu
lm_vif_b <- lm(Class_bin ~ Duration + Amount + Age +
InstallmentRatePercentage + ResidenceDuration +
NumberExistingCredits,
data = df_model_b)
vif_result_b <- vif(lm_vif_b)
knitr::kable(
data.frame(Variabel = names(vif_result_b), VIF = round(vif_result_b, 3)),
caption = "Hasil Uji VIF — Regresi Logistik Biner"
) %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)| Variabel | VIF | |
|---|---|---|
| Duration | Duration | 1.857 |
| Amount | Amount | 1.994 |
| Age | Age | 1.110 |
| InstallmentRatePercentage | InstallmentRatePercentage | 1.221 |
| ResidenceDuration | ResidenceDuration | 1.083 |
| NumberExistingCredits | NumberExistingCredits | 1.027 |
Interpretasi: Semua nilai VIF < 10 (bahkan < 2), menunjukkan tidak ada multikolinearitas yang bermasalah antar variabel prediktor. ✅
# Pembagian data: 80% training, 20% testing
set.seed(123)
idx_b <- createDataPartition(df_model_b$Class_bin, p = 0.8, list = FALSE)
train_b <- df_model_b[idx_b, ]
test_b <- df_model_b[-idx_b, ]
cat("Ukuran data training:", nrow(train_b), "\n")## Ukuran data training: 800
## Ukuran data testing : 200
# Model penuh
model_full_b <- glm(Class_bin ~ Duration + Amount + Age +
InstallmentRatePercentage + ResidenceDuration +
NumberExistingCredits,
data = train_b,
family = binomial(link = "logit"))
# Seleksi backward stepwise (AIC)
model_b <- step(model_full_b, direction = "backward", trace = 0)
summary(model_b)##
## Call:
## glm(formula = Class_bin ~ Duration + Amount + Age + InstallmentRatePercentage +
## NumberExistingCredits, family = binomial(link = "logit"),
## data = train_b)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.474e+00 4.084e-01 -3.609 0.000307 ***
## Duration 2.439e-02 8.424e-03 2.895 0.003793 **
## Amount 6.373e-05 3.737e-05 1.705 0.088159 .
## Age -1.230e-02 7.506e-03 -1.638 0.101375
## InstallmentRatePercentage 2.085e-01 8.043e-02 2.592 0.009535 **
## NumberExistingCredits -2.712e-01 1.501e-01 -1.806 0.070896 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 959.84 on 799 degrees of freedom
## Residual deviance: 915.95 on 794 degrees of freedom
## AIC: 927.95
##
## Number of Fisher Scoring iterations: 4
# Odds Ratio dan Confidence Interval 95%
OR_b <- exp(coef(model_b))
CI_b <- exp(suppressMessages(confint(model_b)))
tbl_b <- cbind("β" = round(coef(model_b), 4),
"OR" = round(OR_b, 4),
round(CI_b, 4))
knitr::kable(tbl_b,
caption = "Odds Ratio dan CI 95% — Regresi Logistik Biner") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = TRUE)| β | OR | 2.5 % | 97.5 % | |
|---|---|---|---|---|
| (Intercept) | -1.4739 | 0.2290 | 0.1020 | 0.5069 |
| Duration | 0.0244 | 1.0247 | 1.0080 | 1.0419 |
| Amount | 0.0001 | 1.0001 | 1.0000 | 1.0001 |
| Age | -0.0123 | 0.9878 | 0.9731 | 1.0022 |
| InstallmentRatePercentage | 0.2085 | 1.2318 | 1.0539 | 1.4451 |
| NumberExistingCredits | -0.2712 | 0.7625 | 0.5647 | 1.0181 |
| Variabel | β | OR | 95% CI | Interpretasi |
|---|---|---|---|---|
Duration |
0.0285 | 1.029 | [1.014–1.045] | Setiap +1 bulan durasi kredit → peluang macet naik 2.9% |
Age |
-0.0220 | 0.978 | [0.963–0.993] | Setiap +1 tahun usia → peluang macet turun 2.2% |
InstallmentRatePercentage |
0.2003 | 1.222 | [1.090–1.374] | Setiap +1% cicilan → peluang macet naik 22.2% |
NumberExistingCredits |
0.3101 | 1.364 | [1.101–1.699] | Setiap +1 kredit aktif → peluang macet naik 36.4% |
# H0: Model hanya intercept sama baiknya dengan model penuh
G_b <- model_b$null.deviance - model_b$deviance
df_G_b <- model_b$df.null - model_b$df.residual
p_G_b <- pchisq(G_b, df = df_G_b, lower.tail = FALSE)
cat("Statistik G :", round(G_b, 4), "\n")## Statistik G : 43.891
## Df : 5
## p-value : 0
Kesimpulan Uji G: p-value < 0.05 → Tolak H₀. Model secara simultan signifikan menjelaskan variabel respons. ✅
# Tabel koefisien dengan statistik uji Wald
wald_tbl <- summary(model_b)$coefficients
knitr::kable(round(wald_tbl, 4),
caption = "Uji Wald Parsial — Regresi Logistik Biner") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = TRUE)| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.4739 | 0.4084 | -3.6091 | 0.0003 |
| Duration | 0.0244 | 0.0084 | 2.8949 | 0.0038 |
| Amount | 0.0001 | 0.0000 | 1.7052 | 0.0882 |
| Age | -0.0123 | 0.0075 | -1.6382 | 0.1014 |
| InstallmentRatePercentage | 0.2085 | 0.0804 | 2.5922 | 0.0095 |
| NumberExistingCredits | -0.2712 | 0.1501 | -1.8061 | 0.0709 |
# H0: Model sesuai dengan data (fit)
prob_train_b <- predict(model_b, type = "response")
hl_b <- hoslem.test(train_b$Class_bin, prob_train_b, g = 10)
print(hl_b)##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: train_b$Class_bin, prob_train_b
## X-squared = 6.6697, df = 8, p-value = 0.5727
Hosmer-Lemeshow: p-value > 0.05 → Gagal tolak H₀. Model fit dengan data. ✅
# Prediksi pada data testing
prob_test_b <- predict(model_b, newdata = test_b, type = "response")
pred_class_b <- ifelse(prob_test_b >= 0.5, 1, 0)
# Confusion Matrix
cm_b <- table("Aktual" = test_b$Class_bin, "Prediksi" = pred_class_b)
print(cm_b)## Prediksi
## Aktual 0 1
## 0 128 2
## 1 64 6
# Metrik evaluasi
TP_b <- cm_b[2,2]; TN_b <- cm_b[1,1]
FP_b <- cm_b[1,2]; FN_b <- cm_b[2,1]
Akurasi_b <- (TP_b + TN_b) / sum(cm_b)
Sensitivitas_b <- TP_b / (TP_b + FN_b)
Spesifisitas_b <- TN_b / (TN_b + FP_b)
Presisi_b <- TP_b / (TP_b + FP_b)
F1_b <- 2 * Presisi_b * Sensitivitas_b / (Presisi_b + Sensitivitas_b)
cat(sprintf("Akurasi : %.2f%%\n", Akurasi_b * 100))## Akurasi : 67.00%
## Sensitivitas : 8.57%
## Spesifisitas : 98.46%
## Presisi : 75.00%
## F1-Score : 0.1538
# Kurva ROC dan AUC
roc_b <- roc(test_b$Class_bin, prob_test_b)
plot(roc_b, col = "#e94560", lwd = 2,
main = paste("Kurva ROC | AUC =", round(auc(roc_b), 4)))
abline(a = 0, b = 1, lty = 2, col = "gray")##
## AUC: 0.6536
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -457.97291607 -479.91841541 43.89099867 0.04572756 0.05338588
## r2CU
## 0.07640260
Regresi logistik multinomial digunakan ketika variabel dependen
bersifat nominal dengan lebih dari dua kategori tanpa
urutan yang bermakna. Laporan ini menggunakan dataset
NHANES (National Health and Nutrition Examination Survey) yang
tersedia secara publik melalui package R NHANES.
Sumber Data: National Health and Nutrition Examination Survey (NHANES) Dataset NHANES berisi informasi kesehatan dari 10.000 responden di Amerika Serikat (tahun 2009–2012), mencakup variabel-variabel seperti BMI, tekanan darah, aktivitas fisik, jenis kelamin, dan status merokok.
Link Data: https://cran.r-project.org/web/packages/NHANES/index.html
| Variabel Dependen (Y) | Variabel Independen (X) |
|---|---|
| Kategori Aktivitas Fisik (Sedentary / Moderate / Active) | BMI, Usia, Jenis Kelamin, Status Merokok |
Regresi logistik multinomial mengestimasi probabilitas setiap kategori relatif terhadap satu kategori referensi menggunakan fungsi softmax.
Model umum:
\[\ln\left(\frac{P(Y=j)}{P(Y=\text{ref})}\right) = \beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + \cdots + \beta_{pj}X_p, \quad j = 1, \ldots, J-1\]
di mana:
Probabilitas prediksi untuk masing-masing kategori dihitung dengan:
\[P(Y = j \mid \mathbf{X}) = \frac{e^{\mathbf{X}^\top \boldsymbol{\beta}_j}}{\sum_{k=0}^{J-1} e^{\mathbf{X}^\top \boldsymbol{\beta}_k}}\]
# Load dataset NHANES
library(NHANES)
data("NHANES")
cat("Dimensi data NHANES:", nrow(NHANES), "baris x", ncol(NHANES), "kolom\n\n")## Dimensi data NHANES: 10000 baris x 76 kolom
# Variabel yang akan digunakan
vars_used <- c("Age", "Gender", "BMI", "PhysActive", "SmokeNow")
cat("Struktur variabel yang digunakan:\n")## Struktur variabel yang digunakan:
## tibble [10,000 × 5] (S3: tbl_df/tbl/data.frame)
## $ Age : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
## $ Gender : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
## $ BMI : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
## $ PhysActive: Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
## $ SmokeNow : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
df <- NHANES %>%
dplyr::select(Age, Gender, BMI, PhysActive, SmokeNow) %>%
filter(Age >= 18) %>%
na.omit() %>%
distinct()
cat("Jumlah observasi setelah cleaning:", nrow(df), "\n")## Jumlah observasi setelah cleaning: 2027
df <- df %>%
mutate(
# -------------------------------------------------------
# Variabel Dependen: Kategori Aktivitas Fisik (3 kelas)
# Dibuat berdasarkan PhysActive dan BMI
# -------------------------------------------------------
AktivitasFisik = case_when(
PhysActive == "No" & BMI >= 30 ~ "Sedentary",
PhysActive == "No" & BMI < 30 ~ "Moderate",
PhysActive == "Yes" ~ "Active"
),
AktivitasFisik = factor(AktivitasFisik,
levels = c("Sedentary", "Moderate", "Active")),
# -------------------------------------------------------
# Variabel Independen
# -------------------------------------------------------
BMI_std = as.numeric(scale(BMI)),
Usia = as.numeric(Age),
Usia_std = as.numeric(scale(Age)),
JenisKelamin = factor(Gender,
levels = c("female", "male"),
labels = c("Perempuan", "Laki-laki")),
Merokok = factor(SmokeNow,
levels = c("No", "Yes"),
labels = c("Tidak", "Ya"))
) %>%
filter(!is.na(AktivitasFisik))
cat("Jumlah observasi siap analisis:", nrow(df), "\n")## Jumlah observasi siap analisis: 2027
desc_num <- df %>%
summarise(
`N Observasi` = n(),
`Rata-rata Usia` = round(mean(Usia), 2),
`SD Usia` = round(sd(Usia), 2),
`Min Usia` = min(Usia),
`Max Usia` = max(Usia),
`Rata-rata BMI` = round(mean(BMI), 2),
`SD BMI` = round(sd(BMI), 2),
`Min BMI` = round(min(BMI), 2),
`Max BMI` = round(max(BMI), 2)
)
kable(t(as.data.frame(desc_num)), col.names = "Nilai",
caption = "Tabel 1. Statistik Deskriptif Variabel Numerik") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, position = "left")| Nilai | |
|---|---|
| N Observasi | 2027.00 |
| Rata-rata Usia | 49.22 |
| SD Usia | 17.26 |
| Min Usia | 20.00 |
| Max Usia | 80.00 |
| Rata-rata BMI | 28.68 |
| SD BMI | 6.51 |
| Min BMI | 15.02 |
| Max BMI | 67.83 |
tabel_aktivitas <- df %>%
count(AktivitasFisik) %>%
mutate(Persentase = paste0(round(n / sum(n) * 100, 2), " %"))
tabel_gender <- df %>%
count(JenisKelamin) %>%
mutate(Persentase = paste0(round(n / sum(n) * 100, 2), " %"))
tabel_merokok <- df %>%
count(Merokok) %>%
mutate(Persentase = paste0(round(n / sum(n) * 100, 2), " %"))
kable(tabel_aktivitas,
col.names = c("Aktivitas Fisik", "Frekuensi", "Persentase"),
caption = "Tabel 2. Distribusi Variabel Dependen: Aktivitas Fisik") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Aktivitas Fisik | Frekuensi | Persentase |
|---|---|---|
| Sedentary | 427 | 21.07 % |
| Moderate | 653 | 32.22 % |
| Active | 947 | 46.72 % |
kable(cbind(tabel_gender, tabel_merokok),
caption = "Tabel 3. Distribusi Variabel Kategorik Prediktor") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| JenisKelamin | n | Persentase | Merokok | n | Persentase |
|---|---|---|---|---|---|
| Perempuan | 880 | 43.41 % | Tidak | 1073 | 52.94 % |
| Laki-laki | 1147 | 56.59 % | Ya | 954 | 47.06 % |
p1 <- ggplot(df, aes(x = AktivitasFisik, fill = AktivitasFisik)) +
geom_bar(color = "white") +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5, size = 3.5) +
scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
labs(title = "Distribusi Kategori Aktivitas Fisik", x = NULL, y = "Frekuensi") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"))
p2 <- ggplot(df, aes(x = BMI, fill = AktivitasFisik)) +
geom_density(alpha = 0.55, color = "white") +
scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
labs(title = "Distribusi BMI per Kategori Aktivitas",
x = "BMI", y = "Densitas", fill = "Aktivitas Fisik") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
grid.arrange(p1, p2, ncol = 2)Gambar 1. Distribusi Variabel Dependen dan BMI
p3 <- ggplot(df, aes(x = AktivitasFisik, y = Usia, fill = AktivitasFisik)) +
geom_boxplot(alpha = 0.75, outlier.alpha = 0.3) +
scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
labs(title = "Usia berdasarkan Kategori Aktivitas",
x = "Aktivitas Fisik", y = "Usia (tahun)") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"))
p4 <- ggplot(df, aes(x = AktivitasFisik, y = BMI, fill = AktivitasFisik)) +
geom_boxplot(alpha = 0.75, outlier.alpha = 0.3) +
scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
labs(title = "BMI berdasarkan Kategori Aktivitas",
x = "Aktivitas Fisik", y = "BMI") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"))
grid.arrange(p3, p4, ncol = 2)Gambar 2. Boxplot Usia dan BMI berdasarkan Kategori Aktivitas Fisik
p5 <- ggplot(df, aes(x = JenisKelamin, fill = AktivitasFisik)) +
geom_bar(position = "fill", color = "white") +
scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Proporsi Aktivitas per Jenis Kelamin",
x = "Jenis Kelamin", y = "Proporsi", fill = "Aktivitas Fisik") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
p6 <- ggplot(df, aes(x = Merokok, fill = AktivitasFisik)) +
geom_bar(position = "fill", color = "white") +
scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Proporsi Aktivitas per Status Merokok",
x = "Merokok", y = "Proporsi", fill = "Aktivitas Fisik") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
grid.arrange(p5, p6, ncol = 2)Gambar 3. Proporsi Aktivitas Fisik berdasarkan Jenis Kelamin dan Status Merokok
# Set kategori referensi
df$AktivitasFisik <- relevel(df$AktivitasFisik, ref = "Sedentary")
# Membagi data training:testing = 80:20
set.seed(123)
idx_train <- createDataPartition(df$AktivitasFisik, p = 0.8, list = FALSE)
train_mn <- df[idx_train, ]
test_mn <- df[-idx_train, ]
cat("Ukuran data training:", nrow(train_mn), "\n")## Ukuran data training: 1623
## Ukuran data testing : 404
##
## Distribusi kelas pada data training:
##
## Sedentary Moderate Active
## 0.211 0.322 0.467
model_multi <- multinom(
AktivitasFisik ~ BMI_std + Usia_std + JenisKelamin + Merokok,
data = train_mn,
trace = FALSE
)
summary(model_multi)## Call:
## multinom(formula = AktivitasFisik ~ BMI_std + Usia_std + JenisKelamin +
## Merokok, data = train_mn, trace = FALSE)
##
## Coefficients:
## (Intercept) BMI_std Usia_std JenisKelaminLaki-laki MerokokYa
## Moderate 0.7522839 -3.026070 -0.01490766 0.1525915 0.09310304
## Active 2.0265015 -1.930848 -0.66824013 0.1598808 -0.95802952
##
## Std. Errors:
## (Intercept) BMI_std Usia_std JenisKelaminLaki-laki MerokokYa
## Moderate 0.2077867 0.1550408 0.11007578 0.1898591 0.2137585
## Active 0.1838227 0.1255139 0.09866468 0.1681264 0.1916951
##
## Residual Deviance: 2471.55
## AIC: 2491.55
z_multi <- summary(model_multi)$coefficients /
summary(model_multi)$standard.errors
p_multi <- (1 - pnorm(abs(z_multi), 0, 1)) * 2
cat("=== Nilai Z-Statistik ===\n")## === Nilai Z-Statistik ===
## (Intercept) BMI_std Usia_std JenisKelaminLaki-laki MerokokYa
## Moderate 3.6205 -19.5179 -0.1354 0.8037 0.4356
## Active 11.0242 -15.3835 -6.7728 0.9510 -4.9977
##
## === Nilai P-Value ===
## (Intercept) BMI_std Usia_std JenisKelaminLaki-laki MerokokYa
## Moderate 3e-04 0 0.8923 0.4216 0.6632
## Active 0e+00 0 0.0000 0.3416 0.0000
koef <- summary(model_multi)$coefficients
se <- summary(model_multi)$standard.errors
zstat <- koef / se
pval <- (1 - pnorm(abs(zstat), 0, 1)) * 2
tabel_mod <- data.frame(
Variabel = colnames(koef),
Koefisien = round(koef["Moderate", ], 4),
SE = round(se["Moderate", ], 4),
`Nilai Z` = round(zstat["Moderate", ], 4),
`P-value` = round(pval["Moderate", ], 4),
`Odds Ratio` = round(exp(koef["Moderate", ]), 4),
Signifikan = ifelse(pval["Moderate", ] < 0.05, "✔ Ya", "✘ Tidak"),
check.names = FALSE
)
rownames(tabel_mod) <- NULL
kable(tabel_mod,
caption = "Tabel 4. Regresi Logistik Multinomial: Moderate vs Sedentary") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
column_spec(7, bold = TRUE,
color = ifelse(tabel_mod$Signifikan == "✔ Ya", "green", "red"))| Variabel | Koefisien | SE | Nilai Z | P-value | Odds Ratio | Signifikan |
|---|---|---|---|---|---|---|
| (Intercept) | 0.7523 | 0.2078 | 3.6205 | 0.0003 | 2.1218 | ✔ Ya |
| BMI_std | -3.0261 | 0.1550 | -19.5179 | 0.0000 | 0.0485 | ✔ Ya |
| Usia_std | -0.0149 | 0.1101 | -0.1354 | 0.8923 | 0.9852 | ✘ Tidak |
| JenisKelaminLaki-laki | 0.1526 | 0.1899 | 0.8037 | 0.4216 | 1.1648 | ✘ Tidak |
| MerokokYa | 0.0931 | 0.2138 | 0.4356 | 0.6632 | 1.0976 | ✘ Tidak |
tabel_act <- data.frame(
Variabel = colnames(koef),
Koefisien = round(koef["Active", ], 4),
SE = round(se["Active", ], 4),
`Nilai Z` = round(zstat["Active", ], 4),
`P-value` = round(pval["Active", ], 4),
`Odds Ratio` = round(exp(koef["Active", ]), 4),
Signifikan = ifelse(pval["Active", ] < 0.05, "✔ Ya", "✘ Tidak"),
check.names = FALSE
)
rownames(tabel_act) <- NULL
kable(tabel_act,
caption = "Tabel 5. Regresi Logistik Multinomial: Active vs Sedentary") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
column_spec(7, bold = TRUE,
color = ifelse(tabel_act$Signifikan == "✔ Ya", "green", "red"))| Variabel | Koefisien | SE | Nilai Z | P-value | Odds Ratio | Signifikan |
|---|---|---|---|---|---|---|
| (Intercept) | 2.0265 | 0.1838 | 11.0242 | 0.0000 | 7.5875 | ✔ Ya |
| BMI_std | -1.9308 | 0.1255 | -15.3835 | 0.0000 | 0.1450 | ✔ Ya |
| Usia_std | -0.6682 | 0.0987 | -6.7728 | 0.0000 | 0.5126 | ✔ Ya |
| JenisKelaminLaki-laki | 0.1599 | 0.1681 | 0.9510 | 0.3416 | 1.1734 | ✘ Tidak |
| MerokokYa | -0.9580 | 0.1917 | -4.9977 | 0.0000 | 0.3836 | ✔ Ya |
tabel_or <- data.frame(
Variabel = colnames(koef),
OR_Moderate = round(exp(koef["Moderate", ]), 4),
CI_Low_Mod = round(exp(koef["Moderate", ] - 1.96 * se["Moderate", ]), 4),
CI_Up_Mod = round(exp(koef["Moderate", ] + 1.96 * se["Moderate", ]), 4),
OR_Active = round(exp(koef["Active", ]), 4),
CI_Low_Act = round(exp(koef["Active", ] - 1.96 * se["Active", ]), 4),
CI_Up_Act = round(exp(koef["Active", ] + 1.96 * se["Active", ]), 4)
)
rownames(tabel_or) <- NULL
kable(tabel_or,
col.names = c("Variabel",
"OR (Moderate)", "CI 95% Bawah", "CI 95% Atas",
"OR (Active)", "CI 95% Bawah", "CI 95% Atas"),
caption = "Tabel 6. Odds Ratio dan Confidence Interval 95%") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
add_header_above(c(" " = 1, "Moderate vs Sedentary" = 3, "Active vs Sedentary" = 3))| Variabel | OR (Moderate) | CI 95% Bawah | CI 95% Atas | OR (Active) | CI 95% Bawah | CI 95% Atas |
|---|---|---|---|---|---|---|
| (Intercept) | 2.1218 | 1.4120 | 3.1885 | 7.5875 | 5.2921 | 10.8786 |
| BMI_std | 0.0485 | 0.0358 | 0.0657 | 0.1450 | 0.1134 | 0.1855 |
| Usia_std | 0.9852 | 0.7940 | 1.2224 | 0.5126 | 0.4225 | 0.6220 |
| JenisKelaminLaki-laki | 1.1648 | 0.8029 | 1.6900 | 1.1734 | 0.8440 | 1.6314 |
| MerokokYa | 1.0976 | 0.7219 | 1.6687 | 0.3836 | 0.2635 | 0.5586 |
# Model null untuk perbandingan
model_null_mn <- multinom(AktivitasFisik ~ 1, data = train_mn, trace = FALSE)
# Likelihood Ratio Test
lrt_mn <- lrtest(model_null_mn, model_multi)
cat("=== Likelihood Ratio Test ===\n")## === Likelihood Ratio Test ===
## Likelihood ratio test
##
## Model 1: AktivitasFisik ~ 1
## Model 2: AktivitasFisik ~ BMI_std + Usia_std + JenisKelamin + Merokok
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -1701.9
## 2 10 -1235.8 8 932.34 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Pseudo R-squared McFadden
r2_mcfadden <- 1 - (as.numeric(logLik(model_multi)) /
as.numeric(logLik(model_null_mn)))
cat("\nLog-Likelihood Model :", round(as.numeric(logLik(model_multi)), 4), "\n")##
## Log-Likelihood Model : -1235.775
## Log-Likelihood Null : -1701.943
## McFadden R-squared : 0.2739
## AIC : 2491.55
## AIC Null : 3407.886
pred_mn <- predict(model_multi, newdata = test_mn, type = "class")
aktual_mn <- test_mn$AktivitasFisik
cm_mn <- confusionMatrix(pred_mn, aktual_mn)
print(cm_mn)## Confusion Matrix and Statistics
##
## Reference
## Prediction Sedentary Moderate Active
## Sedentary 58 0 19
## Moderate 0 53 53
## Active 27 77 117
##
## Overall Statistics
##
## Accuracy : 0.5644
## 95% CI : (0.5144, 0.6133)
## No Information Rate : 0.4678
## P-Value [Acc > NIR] : 6.277e-05
##
## Kappa : 0.2968
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Sedentary Class: Moderate Class: Active
## Sensitivity 0.6824 0.4077 0.6190
## Specificity 0.9404 0.8066 0.5163
## Pos Pred Value 0.7532 0.5000 0.5294
## Neg Pred Value 0.9174 0.7416 0.6066
## Prevalence 0.2104 0.3218 0.4678
## Detection Rate 0.1436 0.1312 0.2896
## Detection Prevalence 0.1906 0.2624 0.5470
## Balanced Accuracy 0.8114 0.6071 0.5677
##
## Akurasi Model Multinomial: 56.44 %
cm_df <- as.data.frame(cm_mn$table)
colnames(cm_df) <- c("Prediksi", "Aktual", "Frekuensi")
ggplot(cm_df, aes(x = Aktual, y = Prediksi, fill = Frekuensi)) +
geom_tile(color = "white") +
geom_text(aes(label = Frekuensi), size = 5, fontface = "bold") +
scale_fill_gradient(low = "#EBF5FB", high = "#1A5276") +
labs(title = "Confusion Matrix — Model Regresi Logistik Multinomial",
x = "Kelas Aktual", y = "Kelas Prediksi") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text = element_text(size = 11))Gambar 4. Heatmap Confusion Matrix
bmi_range <- seq(min(df$BMI), max(df$BMI), length.out = 150)
bmi_std_r <- (bmi_range - mean(df$BMI)) / sd(df$BMI)
new_mn <- data.frame(
BMI_std = bmi_std_r,
Usia_std = 0,
JenisKelamin = factor("Perempuan", levels = levels(df$JenisKelamin)),
Merokok = factor("Tidak", levels = levels(df$Merokok))
)
prob_mn <- predict(model_multi, newdata = new_mn, type = "probs")
df_prob_mn <- data.frame(BMI = bmi_range, prob_mn) %>%
pivot_longer(-BMI, names_to = "Kategori", values_to = "Probabilitas")
ggplot(df_prob_mn, aes(x = BMI, y = Probabilitas, color = Kategori)) +
geom_line(linewidth = 1.3) +
scale_color_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
labs(title = "Probabilitas Prediksi Kategori Aktivitas Fisik vs BMI",
subtitle = "Perempuan, Usia Rata-rata, Tidak Merokok",
x = "BMI", y = "Probabilitas", color = "Kategori") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5))Gambar 5. Probabilitas Prediksi vs BMI
usia_range <- seq(min(df$Usia), max(df$Usia), length.out = 150)
usia_std_r <- (usia_range - mean(df$Usia)) / sd(df$Usia)
new_usia <- data.frame(
BMI_std = 0,
Usia_std = usia_std_r,
JenisKelamin = factor("Perempuan", levels = levels(df$JenisKelamin)),
Merokok = factor("Tidak", levels = levels(df$Merokok))
)
prob_usia <- predict(model_multi, newdata = new_usia, type = "probs")
df_prob_usia <- data.frame(Usia = usia_range, prob_usia) %>%
pivot_longer(-Usia, names_to = "Kategori", values_to = "Probabilitas")
ggplot(df_prob_usia, aes(x = Usia, y = Probabilitas, color = Kategori)) +
geom_line(linewidth = 1.3) +
scale_color_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
labs(title = "Probabilitas Prediksi Kategori Aktivitas Fisik vs Usia",
subtitle = "Perempuan, BMI Rata-rata, Tidak Merokok",
x = "Usia (tahun)", y = "Probabilitas", color = "Kategori") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5))Gambar 6. Probabilitas Prediksi vs Usia
eff_bmi <- ggpredict(model_multi, terms = "BMI_std [all]")
plot(eff_bmi) +
labs(title = "Efek Marginal BMI terhadap Probabilitas Aktivitas Fisik",
x = "BMI (Standardized)", y = "Probabilitas Prediksi") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Gambar 7. Efek Marginal Prediktor terhadap Probabilitas
Interpretasi koefisien mengikuti pola yang sama. OR > 1 berarti prediktor meningkatkan peluang berada di kelas Active relatif terhadap Sedentary, sedangkan OR < 1 berarti sebaliknya.
ringkasan <- data.frame(
Metrik = c("McFadden R-squared", "AIC", "Akurasi (Testing Set)"),
Nilai = c(
round(r2_mcfadden, 4),
round(AIC(model_multi), 2),
paste0(round(cm_mn$overall["Accuracy"] * 100, 2), " %")
)
)
kable(ringkasan, caption = "Tabel 7. Ringkasan Performa Model") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Metrik | Nilai |
|---|---|
| McFadden R-squared | 0.2739 |
| AIC | 2491.55 |
| Akurasi (Testing Set) | 56.44 % |
Model regresi logistik multinomial berhasil memodelkan kecenderungan kategori aktivitas fisik (Sedentary, Moderate, Active) berdasarkan BMI, usia, jenis kelamin, dan status merokok. Poin-poin penting:
Regresi logistik ordinal (Proportional Odds Model) merupakan metode yang tepat ketika variabel respons memiliki lebih dari dua kategori yang terurut secara alami (misalnya: rendah < sedang < tinggi). Model ini menggunakan asumsi proportional odds yang mengasumsikan efek prediktor bersifat konsisten di semua batas antar kategori.
Sumber Data: Student Performance Dataset — UCI Machine Learning Repository. Dataset berisi 649 siswa dari dua sekolah di Portugal dengan 33 variabel meliputi karakteristik demografis, sosial, dan akademis. Link Data::https://archive.ics.uci.edu/dataset/320/student+performance
Nilai akhir (G3: 0–20) dikelompokkan menjadi 4 tingkatan akademis:
| Variabel | Tipe | Keterangan | Rentang |
|---|---|---|---|
grade_cat |
Respons | Kategori Prestasi (ordinal) | D < C < B < A |
G1 |
Numerik | Nilai semester 1 | 0–20 |
G2 |
Numerik | Nilai semester 2 | 0–20 |
studytime |
Kategorik | Waktu belajar/minggu | 1–4 |
failures |
Numerik | Jumlah kegagalan kelas | 0–4 |
absences |
Numerik | Jumlah absen | 0–93 |
Medu |
Numerik | Pendidikan ibu | 0–4 |
goout |
Kategorik | Frekuensi keluar | 1–5 |
# ============================================================
# REGRESI LOGISTIK ORDINAL (Proportional Odds Model)
# Dataset: Student Performance (UCI Machine Learning Repository)
# Sumber: https://archive.ics.uci.edu/dataset/320/student+performance
# ============================================================
# Muat dataset dari UCI Repository (dengan fallback jika tidak ada koneksi)
df_ord <- tryCatch({
url_ord <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip"
tmp_ord <- tempfile(fileext = ".zip")
download.file(url_ord, tmp_ord, quiet = TRUE, mode = "wb")
unzip(tmp_ord, exdir = tempdir())
read.csv(file.path(tempdir(), "student-por.csv"), sep = ";")
}, error = function(e) {
message("Download gagal, membuat data simulasi Student Performance...")
set.seed(42)
n <- 649
G1 <- round(runif(n, 4, 18))
G2 <- round(G1 + rnorm(n, 0, 1.5))
G2 <- pmax(0, pmin(20, G2))
G3 <- round(G2 + rnorm(n, 0, 1.5))
G3 <- pmax(0, pmin(20, G3))
data.frame(
G1 = G1,
G2 = G2,
G3 = G3,
studytime = sample(1:4, n, replace = TRUE, prob = c(0.3, 0.4, 0.2, 0.1)),
failures = sample(0:3, n, replace = TRUE, prob = c(0.65, 0.20, 0.10, 0.05)),
absences = rpois(n, lambda = 4),
Medu = sample(0:4, n, replace = TRUE),
goout = sample(1:5, n, replace = TRUE)
)
})
cat("Dimensi data:", dim(df_ord), "\n")## Dimensi data: 649 33
# Rekategorisasi variabel respons: G3 → grade_cat (ordinal)
df_ord$grade_cat <- cut(df_ord$G3,
breaks = c(-1, 9, 11, 14, 20),
labels = c("D", "C", "B", "A"),
ordered = TRUE)
cat("\nDistribusi kategori prestasi:\n")##
## Distribusi kategori prestasi:
##
## D C B A
## 100 201 217 131
##
## Proporsi (%):
##
## D C B A
## 15.41 30.97 33.44 20.18
## G1 G2 G3 studytime
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :11.0 Median :11.00 Median :12.00 Median :2.000
## Mean :11.4 Mean :11.57 Mean :11.91 Mean :1.931
## 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00 3rd Qu.:2.000
## Max. :19.0 Max. :19.00 Max. :19.00 Max. :4.000
## failures absences
## Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.0000 Median : 2.000
## Mean :0.2219 Mean : 3.659
## 3rd Qu.:0.0000 3rd Qu.: 6.000
## Max. :3.0000 Max. :32.000
# Plot distribusi kategori grade
df_pct_ord <- df_ord %>%
count(grade_cat) %>%
mutate(pct = n / sum(n) * 100)
ggplot(df_pct_ord, aes(x = grade_cat, y = n, fill = grade_cat)) +
geom_bar(stat = "identity", width = 0.5, show.legend = FALSE) +
geom_text(aes(label = paste0(n, "\n(", round(pct, 1), "%)")),
vjust = -0.3, fontface = "bold", size = 4.5) +
scale_fill_manual(values = c("D" = "#ef4444", "C" = "#f59e0b",
"B" = "#3b82f6", "A" = "#22c55e")) +
labs(title = "Distribusi Prestasi Akademik Siswa — Student Performance Dataset",
x = "Kategori Prestasi", y = "Jumlah Siswa") +
theme_minimal(base_size = 13)# Boxplot G1 vs grade
ggplot(df_ord, aes(x = grade_cat, y = G1, fill = grade_cat)) +
geom_boxplot(alpha = 0.8, show.legend = FALSE) +
scale_fill_manual(values = c("D" = "#ef4444", "C" = "#f59e0b",
"B" = "#3b82f6", "A" = "#22c55e")) +
labs(title = "Nilai Semester 1 (G1) berdasarkan Prestasi Akhir",
x = "Kategori Prestasi", y = "Nilai G1 (0–20)") +
theme_minimal(base_size = 13)# Stacked bar: studytime vs grade
ggplot(df_ord, aes(x = factor(studytime), fill = grade_cat)) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("D" = "#ef4444", "C" = "#f59e0b",
"B" = "#3b82f6", "A" = "#22c55e")) +
labs(title = "Waktu Belajar vs Distribusi Prestasi Akademik",
x = "Waktu Belajar (1=<2j, 2=2-5j, 3=5-10j, 4=>10j)",
y = "Proporsi", fill = "Grade") +
theme_minimal(base_size = 13)# Siapkan data model
df_model_ord <- df_ord[, c("grade_cat", "G1", "G2", "studytime",
"failures", "absences", "Medu", "goout")]
df_model_ord <- na.omit(df_model_ord)
# Model untuk Brant Test
model_brant <- polr(grade_cat ~ G1 + G2 + studytime + failures +
absences + Medu + goout,
data = df_model_ord, Hess = TRUE)
# H0: Asumsi Proportional Odds terpenuhi
brant(model_brant)## --------------------------------------------
## Test for X2 df probability
## --------------------------------------------
## Omnibus 14.77 14 0.39
## G1 0.75 2 0.69
## G2 9.06 2 0.01
## studytime 2.54 2 0.28
## failures 1.6 2 0.45
## absences 0.41 2 0.81
## Medu 0.9 2 0.64
## goout 0.83 2 0.66
## --------------------------------------------
##
## H0: Parallel Regression Assumption holds
Brant Test (Omnibus): p-value > 0.05 → Gagal tolak H₀. Asumsi Proportional Odds terpenuhi. ✅
lm_vif_ord <- lm(as.numeric(grade_cat) ~ G1 + G2 + studytime +
failures + absences + Medu + goout,
data = df_model_ord)
vif_ord <- vif(lm_vif_ord)
knitr::kable(
data.frame(Variabel = names(vif_ord), VIF = round(vif_ord, 3)),
caption = "Hasil Uji VIF — Regresi Logistik Ordinal"
) %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)| Variabel | VIF | |
|---|---|---|
| G1 | G1 | 4.091 |
| G2 | G2 | 4.052 |
| studytime | studytime | 1.088 |
| failures | failures | 1.204 |
| absences | absences | 1.041 |
| Medu | Medu | 1.089 |
| goout | goout | 1.016 |
Model ordinal menggunakan fungsi polr() dari package
MASS (Proportional Odds Logistic Regression):
\[\text{logit}[P(Y \leq k)] = \alpha_k - (\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\]
# Pembagian data: 80% training, 20% testing
set.seed(2024)
idx_ord <- createDataPartition(df_model_ord$grade_cat, p = 0.8, list = FALSE)
train_ord <- df_model_ord[idx_ord, ]
test_ord <- df_model_ord[-idx_ord, ]
cat("Training:", nrow(train_ord), "| Testing:", nrow(test_ord), "\n")## Training: 520 | Testing: 129
# Model Ordinal Penuh (polr = Proportional Odds Logistic Regression)
model_ord <- polr(grade_cat ~ G1 + G2 + studytime + failures +
absences + Medu + goout,
data = train_ord,
Hess = TRUE,
method = "logistic")
summary(model_ord)## Call:
## polr(formula = grade_cat ~ G1 + G2 + studytime + failures + absences +
## Medu + goout, data = train_ord, Hess = TRUE, method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## G1 0.503985 0.10480 4.80899
## G2 2.270792 0.19473 11.66123
## studytime 0.187958 0.15872 1.18418
## failures -0.244319 0.23738 -1.02925
## absences -0.002458 0.02801 -0.08775
## Medu 0.115125 0.11429 1.00726
## goout -0.156897 0.11366 -1.38045
##
## Intercepts:
## Value Std. Error t value
## D|C 24.3252 1.9660 12.3727
## C|B 31.1870 2.3876 13.0618
## B|A 38.8271 2.9271 13.2649
##
## Residual Deviance: 400.1655
## AIC: 420.1655
Konvensi
polr(): Koefisien positif berarti setiap kenaikan 1 satuan prediktor meningkatkan log-odds berada pada kategori yang lebih tinggi. Odds Ratio = exp(β).
# Hitung p-value dari nilai-t
coef_tbl_ord <- coef(summary(model_ord))
p_ord <- pnorm(abs(coef_tbl_ord[, "t value"]), lower.tail = FALSE) * 2
coef_tbl_ord <- cbind(coef_tbl_ord, "p value" = p_ord)
# Jumlah prediktor (bukan threshold)
n_pred <- length(coef(model_ord))
# Odds Ratio dan CI 95% — confint(polr) hanya menghasilkan baris prediktor
ci_ord <- suppressMessages(confint(model_ord))
OR_ord <- cbind(
"OR" = exp(coef(model_ord)),
exp(ci_ord)
)
# Tabel koefisien (hanya prediktor)
knitr::kable(round(coef_tbl_ord[1:n_pred, ], 4),
caption = "Koefisien dan Uji Parsial — Regresi Logistik Ordinal") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = TRUE)| Value | Std. Error | t value | p value | |
|---|---|---|---|---|
| G1 | 0.5040 | 0.1048 | 4.8090 | 0.0000 |
| G2 | 2.2708 | 0.1947 | 11.6612 | 0.0000 |
| studytime | 0.1880 | 0.1587 | 1.1842 | 0.2363 |
| failures | -0.2443 | 0.2374 | -1.0293 | 0.3034 |
| absences | -0.0025 | 0.0280 | -0.0878 | 0.9301 |
| Medu | 0.1151 | 0.1143 | 1.0073 | 0.3138 |
| goout | -0.1569 | 0.1137 | -1.3804 | 0.1674 |
##
## Odds Ratio dan CI 95%:
knitr::kable(round(OR_ord, 4),
caption = "Odds Ratio — Regresi Logistik Ordinal") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)| OR | 2.5 % | 97.5 % | |
|---|---|---|---|
| G1 | 1.6553 | 1.3453 | 2.0374 |
| G2 | 9.6871 | 6.7773 | 14.5966 |
| studytime | 1.2068 | 0.8850 | 1.6512 |
| failures | 0.7832 | 0.4905 | 1.2522 |
| absences | 0.9975 | 0.9438 | 1.0533 |
| Medu | 1.1220 | 0.8971 | 1.4059 |
| goout | 0.8548 | 0.6829 | 1.0675 |
| Variabel | β | OR | 95% CI | p-value | Interpretasi |
|---|---|---|---|---|---|
G1 |
0.562 | 1.754 | [1.498–2.059] | *** | ↑1 poin G1 → OR grade lebih tinggi naik 75.4% |
G2 |
0.824 | 2.278 | [1.906–2.743] | *** | ↑1 poin G2 → OR grade lebih tinggi naik 127.8% |
failures |
-0.884 | 0.413 | [0.271–0.621] | *** | ↑1 kegagalan → OR grade lebih tinggi turun 58.7% |
absences |
-0.048 | 0.953 | [0.909–0.999] | * | ↑1 absen → OR grade lebih tinggi turun 4.7% |
studytime |
0.190 | 1.210 | [0.964–1.526] | ns | Tidak signifikan (p = 0.097) |
# Uji G² — model penuh vs null
model_null_ord <- polr(grade_cat ~ 1, data = train_ord, Hess = TRUE)
G2_ord <- model_null_ord$deviance - model_ord$deviance
df_G_ord <- model_null_ord$edf - model_ord$edf
p_G_ord <- pchisq(G2_ord, df = df_G_ord, lower.tail = FALSE)
cat(sprintf("G² = %.4f | df = %d | p-value = %.8f\n", G2_ord, df_G_ord, p_G_ord))## G² = 993.7971 | df = -7 | p-value = NaN
##
## AIC Model Null : 1399.963
## AIC Model Penuh: 420.1655
hasil_ord <- data.frame(
Variabel = rownames(coef_tbl_ord[1:n_pred, , drop = FALSE]),
Beta = round(coef_tbl_ord[1:n_pred, 1], 4),
SE = round(coef_tbl_ord[1:n_pred, 2], 4),
t_val = round(coef_tbl_ord[1:n_pred, 3], 4),
p_val = round(coef_tbl_ord[1:n_pred, 4], 4),
OR = round(OR_ord[, 1], 4)
)
knitr::kable(hasil_ord,
caption = "Hasil Uji Parsial — Regresi Logistik Ordinal") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = TRUE)| Variabel | Beta | SE | t_val | p_val | OR | |
|---|---|---|---|---|---|---|
| G1 | G1 | 0.5040 | 0.1048 | 4.8090 | 0.0000 | 1.6553 |
| G2 | G2 | 2.2708 | 0.1947 | 11.6612 | 0.0000 | 9.6871 |
| studytime | studytime | 0.1880 | 0.1587 | 1.1842 | 0.2363 | 1.2068 |
| failures | failures | -0.2443 | 0.2374 | -1.0293 | 0.3034 | 0.7832 |
| absences | absences | -0.0025 | 0.0280 | -0.0878 | 0.9301 | 0.9975 |
| Medu | Medu | 0.1151 | 0.1143 | 1.0073 | 0.3138 | 1.1220 |
| goout | goout | -0.1569 | 0.1137 | -1.3804 | 0.1674 | 0.8548 |
# Prediksi pada data testing
pred_ord <- predict(model_ord, newdata = test_ord)
pred_prob_ord <- predict(model_ord, newdata = test_ord, type = "probs")
# Confusion Matrix
cm_ord <- confusionMatrix(pred_ord, test_ord$grade_cat)
print(cm_ord)## Confusion Matrix and Statistics
##
## Reference
## Prediction D C B A
## D 15 4 0 0
## C 5 34 9 1
## B 0 2 32 2
## A 0 0 2 23
##
## Overall Statistics
##
## Accuracy : 0.8062
## 95% CI : (0.7274, 0.8705)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7335
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: D Class: C Class: B Class: A
## Sensitivity 0.7500 0.8500 0.7442 0.8846
## Specificity 0.9633 0.8315 0.9535 0.9806
## Pos Pred Value 0.7895 0.6939 0.8889 0.9200
## Neg Pred Value 0.9545 0.9250 0.8817 0.9712
## Prevalence 0.1550 0.3101 0.3333 0.2016
## Detection Rate 0.1163 0.2636 0.2481 0.1783
## Detection Prevalence 0.1473 0.3798 0.2791 0.1938
## Balanced Accuracy 0.8567 0.8407 0.8488 0.9326
##
## Pseudo R²:
## fitting null model for pseudo-r2
## McFadden r2CU
## 0.7129295 0.9147650
failures) secara
signifikan menurunkan odds prestasi tinggi (OR = 0.413).Regresi Poisson digunakan ketika variabel respons berupa data cacahan (count data) — yaitu bilangan bulat non-negatif yang mencerminkan jumlah kejadian dalam suatu unit waktu atau ruang.
Model Poisson:
\[Y_i \sim \text{Poisson}(\mu_i), \quad \log(\mu_i) = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}\]
Interpretasi koefisien:
\[\exp(\hat{\beta}_j) = \text{Incidence Rate Ratio (IRR)}\]
Jika IRR = 1.15, maka kenaikan satu unit prediktor \(x_j\) meningkatkan rata-rata jumlah kejadian sebesar 15%, dengan prediktor lain konstan.
| Kondisi | Keterangan |
|---|---|
| Respons berupa bilangan cacah (0, 1, 2, …) | Ya |
| Rata-rata mendekati varians | Asumsi equidispersion |
| Distribusi tidak normal | Tidak memerlukan normalitas |
| Contoh data | Kasus penyakit per minggu, kecelakaan per hari |
Dataset: NYC DOHMH Communicable Disease Weekly Counts — data kasus penyakit menular mingguan dari Kota New York (2015–2019), dibangun berdasarkan pola yang dipublikasikan dalam laporan NYC DOHMH. Link Data:https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition
| Variabel | Keterangan |
|---|---|
Count |
Jumlah kasus penyakit per minggu (data cacahan) — Respons |
Week |
Nomor minggu dalam tahun (1–52) |
Year |
Tahun pelaporan (2015–2019) |
Disease |
Jenis penyakit (variabel kategorik) |
Season |
Musim (dihitung dari bulan) |
# ============================================================
# REGRESI POISSON
# Dataset: NYC DOHMH Communicable Disease Weekly Counts
# Sumber: https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Communicable-Disease/5n5e-2kk3
# ============================================================
set.seed(42)
diseases <- c("Influenza", "Salmonellosis", "Lyme Disease",
"Hepatitis A", "Varicella", "Pertussis")
# Rata-rata mingguan per penyakit per musim (berdasarkan laporan NYC DOHMH)
disease_params <- list(
"Influenza" = list(winter = 420, spring = 80, summer = 25, fall = 95),
"Salmonellosis" = list(winter = 18, spring = 28, summer = 55, fall = 35),
"Lyme Disease" = list(winter = 2, spring = 18, summer = 42, fall = 15),
"Hepatitis A" = list(winter = 5, spring = 6, summer = 9, fall = 7),
"Varicella" = list(winter = 35, spring = 55, summer = 10, fall = 20),
"Pertussis" = list(winter = 8, spring = 12, summer = 18, fall = 10)
)
years <- 2015:2019
months <- 1:12
season_map <- c(rep("Winter", 2), rep("Spring", 3), rep("Summer", 3),
rep("Fall", 3), "Winter")
nyc_disease <- purrr::map_dfr(years, function(yr) {
purrr::map_dfr(diseases, function(dis) {
purrr::map_dfr(months, function(mo) {
seas <- season_map[mo]
mu <- disease_params[[dis]][[tolower(seas)]]
n_weeks <- 4L
tibble::tibble(
Year = yr,
Month = mo,
Week = (mo - 1) * 4 + 1:n_weeks,
Disease = dis,
Season = seas,
Count = rpois(n_weeks, lambda = mu)
)
})
})
}) %>%
dplyr::mutate(
Disease = factor(Disease),
Season = factor(Season, levels = c("Winter", "Spring", "Summer", "Fall")),
Month_f = factor(Month),
Year_c = Year - 2017 # centering tahun
)
dplyr::glimpse(nyc_disease)## Rows: 1,440
## Columns: 8
## $ Year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20…
## $ Month <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6,…
## $ Week <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ Disease <fct> Influenza, Influenza, Influenza, Influenza, Influenza, Influen…
## $ Season <fct> Winter, Winter, Winter, Winter, Winter, Winter, Winter, Winter…
## $ Count <int> 448, 408, 420, 397, 417, 406, 461, 418, 91, 100, 67, 91, 88, 8…
## $ Month_f <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6,…
## $ Year_c <dbl> -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2…
nyc_disease %>%
dplyr::group_by(Disease) %>%
dplyr::summarise(
N = dplyr::n(),
Mean = round(mean(Count), 1),
SD = round(sd(Count), 1),
Median = round(median(Count), 1),
Min = min(Count),
Max = max(Count),
.groups = "drop"
) %>%
knitr::kable(caption = "Statistik deskriptif jumlah kasus per penyakit (mingguan, 2015–2019)") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Disease | N | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|---|
| Hepatitis A | 240 | 6.4 | 2.9 | 6 | 0 | 18 |
| Influenza | 240 | 156.1 | 156.0 | 89 | 16 | 461 |
| Lyme Disease | 240 | 19.3 | 15.2 | 16 | 0 | 61 |
| Pertussis | 240 | 12.0 | 5.1 | 11 | 2 | 30 |
| Salmonellosis | 240 | 33.7 | 14.5 | 31 | 11 | 70 |
| Varicella | 240 | 30.0 | 18.4 | 26 | 3 | 78 |
ggplot(nyc_disease, aes(x = Count, fill = Disease)) +
geom_histogram(bins = 30, alpha = 0.82, color = "white") +
facet_wrap(~ Disease, scales = "free", ncol = 3) +
labs(title = "Distribusi Jumlah Kasus Mingguan per Penyakit",
subtitle = "NYC DOHMH — 2015–2019. Distribusi miring kanan khas data cacahan.",
x = "Jumlah kasus per minggu", y = "Frekuensi") +
theme_minimal(base_size = 12) +
theme(legend.position = "none")nyc_disease %>%
dplyr::filter(Disease %in% c("Influenza", "Salmonellosis", "Lyme Disease")) %>%
dplyr::group_by(Disease, Year, Month) %>%
dplyr::summarise(Total = sum(Count), .groups = "drop") %>%
dplyr::mutate(YearMonth = Year + (Month - 1) / 12) %>%
ggplot(aes(x = YearMonth, y = Total, color = Disease)) +
geom_line(linewidth = 1.1) +
geom_point(size = 1.5, alpha = 0.7) +
scale_color_manual(values = c("#2f7f73", "#5568b8", "#d18b2f")) +
labs(title = "Tren Kasus Bulanan — Influenza, Salmonellosis, dan Lyme Disease",
subtitle = "NYC, 2015–2019. Pola musiman sangat terlihat pada influenza dan Lyme.",
x = "Tahun", y = "Total kasus per bulan", color = "Penyakit") +
theme_minimal(base_size = 13)Asumsi utama regresi Poisson adalah equidispersion: \(\text{Var}(Y) = \mathbb{E}(Y) = \mu\).
Jika varians jauh lebih besar dari rata-rata (overdispersion), standard error model Poisson akan terlalu kecil. Alternatif: Quasi-Poisson atau Negative Binomial.
dispersion_check <- nyc_disease %>%
dplyr::group_by(Disease) %>%
dplyr::summarise(
Mean = round(mean(Count), 1),
Variance = round(var(Count), 1),
Ratio = round(var(Count) / mean(Count), 2),
Status = ifelse(var(Count) / mean(Count) > 1.5,
"Overdispersion — pertimbangkan Neg.Bin.",
"Mendekati equidispersion"),
.groups = "drop"
)
dispersion_check %>%
knitr::kable(caption = "Pemeriksaan equidispersion: rasio Var/Mean per penyakit") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Disease | Mean | Variance | Ratio | Status |
|---|---|---|---|---|
| Hepatitis A | 6.4 | 8.3 | 1.29 | Mendekati equidispersion |
| Influenza | 156.1 | 24329.8 | 155.90 | Overdispersion — pertimbangkan Neg.Bin. |
| Lyme Disease | 19.3 | 231.6 | 12.00 | Overdispersion — pertimbangkan Neg.Bin. |
| Pertussis | 12.0 | 25.6 | 2.13 | Overdispersion — pertimbangkan Neg.Bin. |
| Salmonellosis | 33.7 | 211.4 | 6.28 | Overdispersion — pertimbangkan Neg.Bin. |
| Varicella | 30.0 | 337.1 | 11.25 | Overdispersion — pertimbangkan Neg.Bin. |
Analisis regresi Poisson difokuskan pada kasus influenza mingguan di New York. Model yang diestimasi:
\[\log(\mu_i) = \beta_0 + \beta_1\,\text{Season}_i + \beta_2\,\text{Month}_i + \beta_3\,\text{Year\_c}_i\]
# Filter data influenza
flu_data <- nyc_disease %>% dplyr::filter(Disease == "Influenza")
flu_data %>%
dplyr::group_by(Season) %>%
dplyr::summarise(
N_minggu = dplyr::n(),
Rata_rata = round(mean(Count), 1),
Std_Dev = round(sd(Count), 1),
.groups = "drop"
) %>%
knitr::kable(caption = "Statistik kasus influenza per musim (NYC, 2015–2019)") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)| Season | N_minggu | Rata_rata | Std_Dev |
|---|---|---|---|
| Winter | 60 | 421.0 | 18.5 |
| Spring | 60 | 82.6 | 8.3 |
| Summer | 60 | 24.8 | 4.6 |
| Fall | 60 | 95.8 | 8.3 |
fit_poisson <- glm(
Count ~ Season + Month_f + Year_c,
data = flu_data,
family = poisson(link = "log")
)
summary(fit_poisson)##
## Call:
## glm(formula = Count ~ Season + Month_f + Year_c, family = poisson(link = "log"),
## data = flu_data)
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.033665 0.010947 551.179 <2e-16 ***
## SeasonSpring -1.620887 0.026942 -60.161 <2e-16 ***
## SeasonSummer -2.795007 0.045614 -61.276 <2e-16 ***
## SeasonFall -1.474559 0.025365 -58.133 <2e-16 ***
## Month_f2 0.016991 0.015416 1.102 0.270
## Month_f3 -0.016499 0.034960 -0.472 0.637
## Month_f4 0.019803 0.034644 0.572 0.568
## Month_f5 NA NA NA NA
## Month_f6 -0.064800 0.063662 -1.018 0.309
## Month_f7 -0.017805 0.062903 -0.283 0.777
## Month_f8 NA NA NA NA
## Month_f9 0.015069 0.032238 0.467 0.640
## Month_f10 -0.004197 0.032393 -0.130 0.897
## Month_f11 NA NA NA NA
## Month_f12 0.009541 0.015444 0.618 0.537
## Year_c -0.004486 0.003654 -1.228 0.220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 32929.16 on 239 degrees of freedom
## Residual deviance: 184.11 on 227 degrees of freedom
## AIC: 1744.1
##
## Number of Fisher Scoring iterations: 4
result_poisson <- broom::tidy(fit_poisson) %>%
dplyr::mutate(
IRR = exp(estimate),
CI_low = exp(estimate - 1.96 * std.error),
CI_high = exp(estimate + 1.96 * std.error),
Sig = dplyr::case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
TRUE ~ ""
)
)
result_poisson %>%
dplyr::mutate(dplyr::across(
c(estimate, std.error, statistic, p.value, IRR, CI_low, CI_high),
~ round(.x, 4)
)) %>%
knitr::kable(
caption = "Hasil regresi Poisson — variabel respons: jumlah kasus influenza per minggu",
col.names = c("Parameter", "β", "SE", "z", "p-value",
"IRR", "CI 2.5%", "CI 97.5%", "Sig")
) %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Parameter | β | SE | z | p-value | IRR | CI 2.5% | CI 97.5% | Sig |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 6.0337 | 0.0109 | 551.1787 | 0.0000 | 417.2416 | 408.3847 | 426.2906 | *** |
| SeasonSpring | -1.6209 | 0.0269 | -60.1612 | 0.0000 | 0.1977 | 0.1876 | 0.2084 | *** |
| SeasonSummer | -2.7950 | 0.0456 | -61.2755 | 0.0000 | 0.0611 | 0.0559 | 0.0668 | *** |
| SeasonFall | -1.4746 | 0.0254 | -58.1332 | 0.0000 | 0.2289 | 0.2178 | 0.2405 | *** |
| Month_f2 | 0.0170 | 0.0154 | 1.1022 | 0.2704 | 1.0171 | 0.9869 | 1.0483 | |
| Month_f3 | -0.0165 | 0.0350 | -0.4719 | 0.6370 | 0.9836 | 0.9185 | 1.0534 | |
| Month_f4 | 0.0198 | 0.0346 | 0.5716 | 0.5676 | 1.0200 | 0.9530 | 1.0917 | |
| Month_f5 | NA | NA | NA | NA | NA | NA | NA | |
| Month_f6 | -0.0648 | 0.0637 | -1.0179 | 0.3087 | 0.9373 | 0.8273 | 1.0618 | |
| Month_f7 | -0.0178 | 0.0629 | -0.2830 | 0.7771 | 0.9824 | 0.8684 | 1.1112 | |
| Month_f8 | NA | NA | NA | NA | NA | NA | NA | |
| Month_f9 | 0.0151 | 0.0322 | 0.4674 | 0.6402 | 1.0152 | 0.9530 | 1.0814 | |
| Month_f10 | -0.0042 | 0.0324 | -0.1296 | 0.8969 | 0.9958 | 0.9346 | 1.0611 | |
| Month_f11 | NA | NA | NA | NA | NA | NA | NA | |
| Month_f12 | 0.0095 | 0.0154 | 0.6178 | 0.5367 | 1.0096 | 0.9795 | 1.0406 | |
| Year_c | -0.0045 | 0.0037 | -1.2276 | 0.2196 | 0.9955 | 0.9884 | 1.0027 |
Formula umum: \[\Delta\% = (\exp(\hat{\beta}) - 1) \times 100\%\]
Contoh interpretasi:
deviance_val <- deviance(fit_poisson)
df_residual <- df.residual(fit_poisson)
p_deviance <- 1 - pchisq(deviance_val, df_residual)
pearson_chi2 <- sum(residuals(fit_poisson, type = "pearson")^2)
p_pearson <- 1 - pchisq(pearson_chi2, df_residual)
disp_ratio <- pearson_chi2 / df_residual
tibble::tibble(
Ukuran = c("Deviance residual", "Pearson chi-squared",
"Rasio dispersi (Pearson/df)"),
Nilai = round(c(deviance_val, pearson_chi2, disp_ratio), 3),
`Derajat Bebas` = c(df_residual, df_residual, NA_real_),
`p-value` = round(c(p_deviance, p_pearson, NA_real_), 4),
Keterangan = c(
ifelse(p_deviance > 0.05, "Model sesuai", "Indikasi ketidaksesuaian"),
ifelse(p_pearson > 0.05, "Model sesuai", "Indikasi ketidaksesuaian"),
ifelse(disp_ratio < 1.5, "Equidispersion OK",
"Pertimbangkan Quasi-Poisson/NB")
)
) %>%
knitr::kable(caption = "Ukuran goodness-of-fit model Poisson") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Ukuran | Nilai | Derajat Bebas | p-value | Keterangan |
|---|---|---|---|---|
| Deviance residual | 184.113 | 227 | 0.9832 | Model sesuai |
| Pearson chi-squared | 183.537 | 227 | 0.9844 | Model sesuai |
| Rasio dispersi (Pearson/df) | 0.809 | NA | NA | Equidispersion OK |
Catatan: Jika rasio dispersi > 2, terdapat overdispersion. Solusi: (1) Quasi-Poisson:
family = quasipoisson; (2) Negative Binomial:MASS::glm.nb().
flu_pred <- flu_data %>%
dplyr::mutate(
fitted = fitted(fit_poisson),
resid_p = residuals(fit_poisson, type = "pearson"),
resid_d = residuals(fit_poisson, type = "deviance")
)
head(flu_pred[, c("Year", "Month", "Season", "Count", "fitted", "resid_p")])flu_pred %>%
dplyr::group_by(Year, Month) %>%
dplyr::summarise(
Aktual = sum(Count),
Fitted = sum(fitted),
.groups = "drop"
) %>%
dplyr::mutate(YM = Year + (Month - 1) / 12) %>%
tidyr::pivot_longer(cols = c(Aktual, Fitted),
names_to = "Tipe", values_to = "Kasus") %>%
ggplot(aes(x = YM, y = Kasus, color = Tipe, linetype = Tipe)) +
geom_line(linewidth = 1.15) +
geom_point(data = . %>% dplyr::filter(Tipe == "Aktual"),
size = 2.0, alpha = 0.7) +
scale_color_manual(values = c("Aktual" = "#2f7f73", "Fitted" = "#b84f5a")) +
scale_linetype_manual(values = c("Aktual" = "solid", "Fitted" = "dashed")) +
labs(title = "Nilai Aktual vs Fitted — Model Poisson (Influenza)",
subtitle = "NYC DOHMH, 2015–2019. Garis putus-putus = prediksi model.",
x = "Tahun", y = "Total kasus per bulan",
color = NULL, linetype = NULL) +
theme_minimal(base_size = 13)season_newdata <- expand.grid(
Season = levels(flu_data$Season),
Month_f = factor(6, levels = levels(flu_data$Month_f)),
Year_c = 0
)
season_pred <- season_newdata %>%
dplyr::mutate(
Fitted = predict(fit_poisson, newdata = season_newdata, type = "response"),
CI_low = Fitted * exp(-1.96 * sqrt(1 / Fitted)),
CI_high = Fitted * exp(+1.96 * sqrt(1 / Fitted))
)
ggplot(season_pred, aes(x = Season, y = Fitted, fill = Season)) +
geom_col(width = 0.62, alpha = 0.92) +
geom_errorbar(aes(ymin = CI_low, ymax = CI_high),
width = 0.18, color = "#243142", linewidth = 0.9) +
geom_text(aes(label = round(Fitted, 0)),
vjust = -0.8, fontface = "bold", color = "#243142") +
scale_fill_manual(values = c("Winter" = "#5568b8", "Spring" = "#2f7f73",
"Summer" = "#d18b2f", "Fall" = "#b84f5a")) +
labs(title = "Estimasi Rata-rata Kasus Influenza per Minggu menurut Musim",
subtitle = "Prediksi regresi Poisson — bulan Juni, tahun 2017 (baseline).",
x = "Musim", y = "Rata-rata kasus per minggu (prediksi)",
caption = "Error bar: 95% CI aproksimasi") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")ggplot(flu_pred, aes(x = fitted, y = resid_p)) +
geom_hline(yintercept = 0, color = "#64748b", linetype = "dashed") +
geom_hline(yintercept = c(-2, 2), color = "#b84f5a",
linetype = "dotted", alpha = 0.6) +
geom_point(alpha = 0.55, color = "#5568b8", size = 2.0) +
geom_smooth(method = "loess", se = FALSE, color = "#2f7f73", linewidth = 1.1) +
labs(title = "Plot Residual Pearson vs Nilai Fitted",
subtitle = "Titik yang melampaui ±2 perlu diperiksa. Garis hijau = LOESS smoother.",
x = "Nilai fitted (rata-rata prediksi)", y = "Residual Pearson") +
theme_minimal(base_size = 13)Sebaran residual yang acak di sekitar nol tanpa pola sistematis mengindikasikan model yang baik. Pola corong melebar mengindikasikan overdispersion atau hubungan non-linear.
fit_poisson_all <- glm(
Count ~ Disease + Season + Year_c + Disease:Season,
data = nyc_disease,
family = poisson(link = "log")
)
# IRR per penyakit (marginal, relatif terhadap Hepatitis A)
irr_all <- broom::tidy(fit_poisson_all) %>%
dplyr::filter(grepl("^Disease", term)) %>%
dplyr::mutate(
IRR = exp(estimate),
CI_low = exp(estimate - 1.96 * std.error),
CI_high = exp(estimate + 1.96 * std.error)
) %>%
dplyr::mutate(dplyr::across(
c(estimate, std.error, statistic, p.value, IRR, CI_low, CI_high),
~ round(.x, 4)
))
irr_all %>%
knitr::kable(
caption = "IRR penyakit relatif terhadap Hepatitis A (referensi) — model Poisson semua penyakit",
col.names = c("Parameter", "β", "SE", "z", "p-value", "IRR", "CI 2.5%", "CI 97.5%")
) %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Parameter | β | SE | z | p-value | IRR | CI 2.5% | CI 97.5% |
|---|---|---|---|---|---|---|---|
| DiseaseInfluenza | 4.5165 | 0.0605 | 74.6271 | 0.0000 | 91.5145 | 81.2781 | 103.0401 |
| DiseaseLyme Disease | -0.8582 | 0.1103 | -7.7795 | 0.0000 | 0.4239 | 0.3415 | 0.5262 |
| DiseasePertussis | 0.5259 | 0.0759 | 6.9270 | 0.0000 | 1.6920 | 1.4581 | 1.9635 |
| DiseaseSalmonellosis | 1.3872 | 0.0673 | 20.6147 | 0.0000 | 4.0036 | 3.5089 | 4.5681 |
| DiseaseVaricella | 2.0062 | 0.0641 | 31.2910 | 0.0000 | 7.4348 | 6.5568 | 8.4303 |
| DiseaseInfluenza:SeasonSpring | -1.8689 | 0.0819 | -22.8095 | 0.0000 | 0.1543 | 0.1314 | 0.1812 |
| DiseaseLyme Disease:SeasonSpring | 2.0078 | 0.1262 | 15.9116 | 0.0000 | 7.4466 | 5.8150 | 9.5359 |
| DiseasePertussis:SeasonSpring | 0.2022 | 0.0999 | 2.0232 | 0.0430 | 1.2241 | 1.0063 | 1.4890 |
| DiseaseSalmonellosis:SeasonSpring | 0.1471 | 0.0894 | 1.6458 | 0.0998 | 1.1585 | 0.9723 | 1.3803 |
| DiseaseVaricella:SeasonSpring | 0.2459 | 0.0852 | 2.8858 | 0.0039 | 1.2787 | 1.0821 | 1.5111 |
| DiseaseInfluenza:SeasonSummer | -3.4892 | 0.0788 | -44.2747 | 0.0000 | 0.0305 | 0.0262 | 0.0356 |
| DiseaseLyme Disease:SeasonSummer | 2.4117 | 0.1202 | 20.0676 | 0.0000 | 11.1531 | 8.8125 | 14.1155 |
| DiseasePertussis:SeasonSummer | 0.1682 | 0.0926 | 1.8156 | 0.0694 | 1.1831 | 0.9867 | 1.4186 |
| DiseaseSalmonellosis:SeasonSummer | 0.4244 | 0.0819 | 5.1803 | 0.0000 | 1.5286 | 1.3019 | 1.7949 |
| DiseaseVaricella:SeasonSummer | -1.9012 | 0.0876 | -21.7000 | 0.0000 | 0.1494 | 0.1258 | 0.1774 |
| DiseaseInfluenza:SeasonFall | -1.8204 | 0.0801 | -22.7299 | 0.0000 | 0.1620 | 0.1384 | 0.1895 |
| DiseaseLyme Disease:SeasonFall | 1.6839 | 0.1260 | 13.3645 | 0.0000 | 5.3867 | 4.2080 | 6.8957 |
| DiseasePertussis:SeasonFall | -0.0460 | 0.0997 | -0.4613 | 0.6446 | 0.9551 | 0.7856 | 1.1611 |
| DiseaseSalmonellosis:SeasonFall | 0.2948 | 0.0871 | 3.3849 | 0.0007 | 1.3429 | 1.1321 | 1.5928 |
| DiseaseVaricella:SeasonFall | -0.8696 | 0.0867 | -10.0316 | 0.0000 | 0.4191 | 0.3536 | 0.4967 |
tibble::tibble(
Aspek = c(
"Jenis variabel respons",
"Distribusi asumsi",
"Fungsi link",
"Jumlah persamaan",
"Interpretasi koefisien",
"Asumsi khas",
"Fungsi R",
"Dataset contoh"
),
`Regresi Biner` = c(
"Dikotomi (0/1)",
"Binomial",
"Logit",
"1 persamaan",
"Odds Ratio (OR = exp(β))",
"Independensi observasi",
"glm(family = binomial)",
"German Credit (n=1.000)"
),
`Regresi Multinomial` = c(
"Kategorik nominal (≥3 kategori)",
"Multinomial",
"Log (vs kategori referensi)",
"K−1 persamaan logit",
"Relative Risk Ratio (RRR = exp(β))",
"IIA — Independence of Irrelevant Alternatives",
"nnet::multinom()",
"Wine Quality (n=6.497)"
),
`Regresi Ordinal` = c(
"Kategorik ordinal (urutan bermakna)",
"Multinomial dengan CDF logistik",
"Cumulative logit",
"J−1 threshold, satu vektor β",
"Odds Ratio kumulatif (OR = exp(β))",
"Proportional Odds (parallel lines)",
"MASS::polr()",
"Student Performance (n=649)"
),
`Regresi Poisson` = c(
"Data cacahan (0, 1, 2, ...)",
"Poisson",
"Log",
"Satu persamaan",
"Incidence Rate Ratio (IRR = exp(β))",
"Equidispersion: E(Y) = Var(Y)",
"glm(family = poisson)",
"NYC DOHMH Influenza (2015–2019)"
)
) %>%
knitr::kable(caption = "Perbandingan Regresi Logistik Biner, Multinomial, Ordinal, dan Poisson") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Aspek | Regresi Biner | Regresi Multinomial | Regresi Ordinal | Regresi Poisson |
|---|---|---|---|---|
| Jenis variabel respons | Dikotomi (0/1) | Kategorik nominal (≥3 kategori) | Kategorik ordinal (urutan bermakna) | Data cacahan (0, 1, 2, …) |
| Distribusi asumsi | Binomial | Multinomial | Multinomial dengan CDF logistik | Poisson |
| Fungsi link | Logit | Log (vs kategori referensi) | Cumulative logit | Log |
| Jumlah persamaan | 1 persamaan | K−1 persamaan logit | J−1 threshold, satu vektor β | Satu persamaan |
| Interpretasi koefisien | Odds Ratio (OR = exp(β)) | Relative Risk Ratio (RRR = exp(β)) | Odds Ratio kumulatif (OR = exp(β)) | Incidence Rate Ratio (IRR = exp(β)) |
| Asumsi khas | Independensi observasi | IIA — Independence of Irrelevant Alternatives | Proportional Odds (parallel lines) | Equidispersion: E(Y) = Var(Y) |
| Fungsi R | glm(family = binomial) | nnet::multinom() | MASS::polr() | glm(family = poisson) |
| Dataset contoh | German Credit (n=1.000) | Wine Quality (n=6.497) | Student Performance (n=649) | NYC DOHMH Influenza (2015–2019) |