Regresi logistik biner merupakan metode statistika yang digunakan untuk memodelkan hubungan antara satu atau lebih variabel prediktor dengan variabel respons dikotomi (dua kategori: 0/1). Metode ini sangat relevan dalam dunia perbankan, khususnya untuk memprediksi apakah seorang nasabah berpotensi mengalami kredit macet atau tidak.
Sumber data: German Credit Dataset — UCI
Machine Learning Repository, diakses melalui package
caret:
https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data
Variabel Penelitian
| Variabel | Tipe | Keterangan | Kategori/Satuan |
|---|---|---|---|
| Class | Respons | Status Kredit | Good (0) / Bad (1) |
| Duration | Numerik | Durasi kredit | Bulan |
| Amount | Numerik | Jumlah kredit | DM (Deutschmark) |
| InstallmentRatePercentage | Numerik | Tingkat cicilan | % dari pendapatan |
| Age | Numerik | Usia nasabah | Tahun |
| NumberExistingCredits | Numerik | Jumlah kredit aktif | Count |
| ResidenceDuration | Numerik | Lama tinggal | Tahun |
# Muat dataset German Credit dari package caret
data(GermanCredit)
df_biner <- GermanCredit
# Buat variabel respons biner: Good=0, Bad=1
df_biner$Class_bin <- ifelse(df_biner$Class == "Bad", 1, 0)
cat("Dimensi data:", dim(df_biner), "\n")## Dimensi data: 1000 63
str(df_biner[, c("Duration", "Amount", "Age",
"InstallmentRatePercentage",
"ResidenceDuration", "NumberExistingCredits",
"Class_bin")])## 'data.frame': 1000 obs. of 7 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ InstallmentRatePercentage: int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ Class_bin : num 0 1 0 0 1 0 0 0 0 1 ...
## Duration Amount Age InstallmentRatePercentage
## Min. : 4.0 Min. : 250 Min. :19.00 Min. :1.000
## 1st Qu.:12.0 1st Qu.: 1366 1st Qu.:27.00 1st Qu.:2.000
## Median :18.0 Median : 2320 Median :33.00 Median :3.000
## Mean :20.9 Mean : 3271 Mean :35.55 Mean :2.973
## 3rd Qu.:24.0 3rd Qu.: 3972 3rd Qu.:42.00 3rd Qu.:4.000
## Max. :72.0 Max. :18424 Max. :75.00 Max. :4.000
## Class_bin
## Min. :0.0
## 1st Qu.:0.0
## Median :0.0
## Mean :0.3
## 3rd Qu.:1.0
## Max. :1.0
##
## Distribusi Status Kredit:
##
## Bad Good
## 300 700
##
## Proporsi (%):
##
## Bad Good
## 30 70
Interpretasi: Dataset terdiri dari 1.000 observasi. Terdapat 700 nasabah (70%) dengan kredit baik (Good) dan 300 nasabah (30%) dengan kredit macet (Bad). Rata-rata durasi kredit 20,9 bulan, rata-rata jumlah kredit 3.271 DM, dan rata-rata usia 35,5 tahun.
df_plot_b <- df_biner %>%
group_by(Class) %>%
summarise(n = n(), pct = n() / nrow(df_biner) * 100, .groups = "drop")
ggplot(df_plot_b, aes(x = Class, y = n, fill = Class)) +
geom_bar(stat = "identity", width = 0.5, show.legend = FALSE) +
geom_text(aes(label = paste0(n, "\n(", round(pct, 1), "%)")),
vjust = -0.3, fontface = "bold", size = 5) +
scale_fill_manual(values = c("Bad" = "#1a7a45", "Good" = "#0a2e1a")) +
labs(title = "Distribusi Status Kredit — German Credit Dataset",
x = "Status Kredit", y = "Frekuensi") +
theme_green()Interpretasi: Dari total 1.000 nasabah, 700 berstatus kredit baik (Good, 70%) dan 300 berstatus kredit macet (Bad, 30%). Ketidakseimbangan kelas ini perlu dipertimbangkan dalam evaluasi model.
ggplot(df_biner, aes(x = Class, y = Amount, fill = Class)) +
geom_boxplot(alpha = 0.8, show.legend = FALSE) +
scale_fill_manual(values = c("Bad" = "#52c98a", "Good" = "#1a7a45")) +
labs(title = "Jumlah Kredit berdasarkan Status Kredit",
x = "Status Kredit", y = "Jumlah Kredit (DM)") +
theme_green()Interpretasi: Nasabah dengan kredit macet (Bad) cenderung memiliki median jumlah kredit yang lebih tinggi, mengindikasikan bahwa jumlah kredit yang lebih besar berhubungan dengan risiko macet yang lebih tinggi.
ggplot(df_biner, aes(x = Age, fill = Class)) +
geom_histogram(binwidth = 5, position = "dodge", alpha = 0.8) +
scale_fill_manual(values = c("Bad" = "#52c98a", "Good" = "#1a7a45")) +
labs(title = "Distribusi Usia berdasarkan Status Kredit",
x = "Usia (Tahun)", y = "Frekuensi") +
theme_green()Interpretasi: Nasabah muda (20–30 tahun) memiliki proporsi kredit macet yang lebih tinggi relatif terhadap kelompok usia lainnya, mengindikasikan usia merupakan faktor risiko yang relevan.
df_model_b <- df_biner[, c("Class_bin", "Duration", "Amount", "Age",
"InstallmentRatePercentage",
"ResidenceDuration",
"NumberExistingCredits")]
df_model_b <- na.omit(df_model_b)
lm_vif_b <- lm(Class_bin ~ Duration + Amount + Age +
InstallmentRatePercentage + ResidenceDuration +
NumberExistingCredits,
data = df_model_b)
vif_result_b <- vif(lm_vif_b)
knitr::kable(
data.frame(Variabel = names(vif_result_b), VIF = round(vif_result_b, 3)),
caption = "Hasil Uji VIF — Regresi Logistik Biner"
) %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)| Variabel | VIF | |
|---|---|---|
| Duration | Duration | 1.857 |
| Amount | Amount | 1.994 |
| Age | Age | 1.110 |
| InstallmentRatePercentage | InstallmentRatePercentage | 1.221 |
| ResidenceDuration | ResidenceDuration | 1.083 |
| NumberExistingCredits | NumberExistingCredits | 1.027 |
Interpretasi: Semua nilai VIF < 10 (bahkan < 2), menunjukkan tidak ada multikolinearitas yang bermasalah antar variabel prediktor.
set.seed(123)
idx_b <- createDataPartition(df_model_b$Class_bin, p = 0.8, list = FALSE)
train_b <- df_model_b[idx_b, ]
test_b <- df_model_b[-idx_b, ]
cat("Jumlah training:", nrow(train_b), "\n")## Jumlah training: 800
## Jumlah testing : 200
Interpretasi: Data dibagi 80:20 menggunakan stratified random sampling. Sebanyak 800 observasi untuk training dan 200 untuk testing.
set.seed(123)memastikan hasil reproducible.
model_full_b <- glm(Class_bin ~ Duration + Amount + Age +
InstallmentRatePercentage + ResidenceDuration +
NumberExistingCredits,
data = train_b,
family = binomial(link = "logit"))
summary(model_full_b)##
## Call:
## glm(formula = Class_bin ~ Duration + Amount + Age + InstallmentRatePercentage +
## ResidenceDuration + NumberExistingCredits, family = binomial(link = "logit"),
## data = train_b)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.498e+00 4.269e-01 -3.510 0.000448 ***
## Duration 2.434e-02 8.428e-03 2.888 0.003876 **
## Amount 6.369e-05 3.737e-05 1.704 0.088346 .
## Age -1.268e-02 7.756e-03 -1.636 0.101941
## InstallmentRatePercentage 2.082e-01 8.044e-02 2.588 0.009642 **
## ResidenceDuration 1.494e-02 7.549e-02 0.198 0.843124
## NumberExistingCredits -2.726e-01 1.503e-01 -1.814 0.069688 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 959.84 on 799 degrees of freedom
## Residual deviance: 915.91 on 793 degrees of freedom
## AIC: 929.91
##
## Number of Fisher Scoring iterations: 4
Interpretasi: Model penuh mencakup 6 prediktor numerik. Variabel yang menunjukkan pengaruh awal signifikan (p < 0,05) pada model penuh akan dikonfirmasi melalui seleksi stepwise.
##
## Call:
## glm(formula = Class_bin ~ Duration + Amount + Age + InstallmentRatePercentage +
## NumberExistingCredits, family = binomial(link = "logit"),
## data = train_b)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.474e+00 4.084e-01 -3.609 0.000307 ***
## Duration 2.439e-02 8.424e-03 2.895 0.003793 **
## Amount 6.373e-05 3.737e-05 1.705 0.088159 .
## Age -1.230e-02 7.506e-03 -1.638 0.101375
## InstallmentRatePercentage 2.085e-01 8.043e-02 2.592 0.009535 **
## NumberExistingCredits -2.712e-01 1.501e-01 -1.806 0.070896 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 959.84 on 799 degrees of freedom
## Residual deviance: 915.95 on 794 degrees of freedom
## AIC: 927.95
##
## Number of Fisher Scoring iterations: 4
Interpretasi: Metode backward stepwise mengeluarkan variabel yang tidak berkontribusi signifikan. Model final lebih parsimonious dengan AIC lebih rendah dibanding model penuh.
G_b <- model_b$null.deviance - model_b$deviance
df_G_b <- model_b$df.null - model_b$df.residual
p_G_b <- pchisq(G_b, df = df_G_b, lower.tail = FALSE)
uji_G_b <- data.frame(G2 = round(G_b, 3), df = df_G_b, p_value = signif(p_G_b, 4))
knitr::kable(uji_G_b, caption = "Hasil Uji G — Regresi Logistik Biner")| G2 | df | p_value |
|---|---|---|
| 43.891 | 5 | 0 |
Interpretasi: p-value < 0,05 → Tolak H₀. Model secara simultan signifikan — minimal satu prediktor berpengaruh nyata terhadap probabilitas kredit macet.
wald_tbl_b <- summary(model_b)$coefficients
knitr::kable(round(wald_tbl_b, 4),
caption = "Uji Wald Parsial — Regresi Logistik Biner") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = TRUE)| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.4739 | 0.4084 | -3.6091 | 0.0003 |
| Duration | 0.0244 | 0.0084 | 2.8949 | 0.0038 |
| Amount | 0.0001 | 0.0000 | 1.7052 | 0.0882 |
| Age | -0.0123 | 0.0075 | -1.6382 | 0.1014 |
| InstallmentRatePercentage | 0.2085 | 0.0804 | 2.5922 | 0.0095 |
| NumberExistingCredits | -0.2712 | 0.1501 | -1.8061 | 0.0709 |
Interpretasi: Variabel yang signifikan secara parsial (p < 0,05) memberikan kontribusi nyata terhadap prediksi risiko kredit macet setelah mengontrol variabel lain dalam model.
prob_train_b <- predict(model_b, type = "response")
hl_b <- hoslem.test(train_b$Class_bin, prob_train_b, g = 10)
print(hl_b)##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: train_b$Class_bin, prob_train_b
## X-squared = 6.6697, df = 8, p-value = 0.5727
Interpretasi: Jika p-value > 0,05 → Gagal tolak H₀ → model fit dengan data. Uji Hosmer-Lemeshow sensitif terhadap ukuran sampel besar, sehingga hasil perlu dipadukan dengan evaluasi metrik lainnya.
OR_b <- exp(coef(model_b))
CI_b <- exp(suppressMessages(confint(model_b)))
tbl_b <- cbind("β" = round(coef(model_b), 4),
"OR" = round(OR_b, 4),
round(CI_b, 4))
knitr::kable(tbl_b,
caption = "Odds Ratio dan CI 95% — Regresi Logistik Biner") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = TRUE)| β | OR | 2.5 % | 97.5 % | |
|---|---|---|---|---|
| (Intercept) | -1.4739 | 0.2290 | 0.1020 | 0.5069 |
| Duration | 0.0244 | 1.0247 | 1.0080 | 1.0419 |
| Amount | 0.0001 | 1.0001 | 1.0000 | 1.0001 |
| Age | -0.0123 | 0.9878 | 0.9731 | 1.0022 |
| InstallmentRatePercentage | 0.2085 | 1.2318 | 1.0539 | 1.4451 |
| NumberExistingCredits | -0.2712 | 0.7625 | 0.5647 | 1.0181 |
Interpretasi: OR > 1 menunjukkan peningkatan peluang kredit macet. Variabel
Duration(OR ≈ 1,03): setiap +1 bulan durasi kredit meningkatkan peluang macet ~3%. VariabelAge(OR < 1): setiap tambah usia menurunkan peluang macet.InstallmentRatePercentage(OR ≈ 1,22): setiap +1% cicilan meningkatkan peluang macet ~22%.
prob_test_b <- predict(model_b, newdata = test_b, type = "response")
pred_class_b <- ifelse(prob_test_b >= 0.5, 1, 0)
cm_b <- table("Aktual" = test_b$Class_bin, "Prediksi" = pred_class_b)
cm_b## Prediksi
## Aktual 0 1
## 0 128 2
## 1 64 6
Interpretasi: Elemen diagonal (TN dan TP) merupakan prediksi yang benar. False Negative (nasabah macet diprediksi baik) memiliki implikasi bisnis yang lebih serius dan perlu mendapat perhatian khusus.
TP_b <- cm_b[2,2]; TN_b <- cm_b[1,1]
FP_b <- cm_b[1,2]; FN_b <- cm_b[2,1]
Akurasi_b <- (TP_b + TN_b) / sum(cm_b)
Sensitivitas_b <- TP_b / (TP_b + FN_b)
Spesifisitas_b <- TN_b / (TN_b + FP_b)
Presisi_b <- TP_b / (TP_b + FP_b)
F1_b <- 2 * Presisi_b * Sensitivitas_b / (Presisi_b + Sensitivitas_b)
hasil_metric_b <- data.frame(
Metrik = c("Accuracy", "Sensitivity", "Specificity", "Precision", "F1-Score"),
Nilai = round(c(Akurasi_b, Sensitivitas_b, Spesifisitas_b, Presisi_b, F1_b), 4)
)
knitr::kable(hasil_metric_b, caption = "Metrik Evaluasi — Regresi Logistik Biner")| Metrik | Nilai |
|---|---|
| Accuracy | 0.6700 |
| Sensitivity | 0.0857 |
| Specificity | 0.9846 |
| Precision | 0.7500 |
| F1-Score | 0.1538 |
Interpretasi: Metrik evaluasi menunjukkan kemampuan model dalam mengklasifikasikan status kredit. Akurasi menunjukkan proporsi prediksi benar secara keseluruhan, sedangkan sensitivitas mengukur kemampuan mendeteksi kasus kredit macet secara spesifik.
roc_b <- roc(test_b$Class_bin, prob_test_b)
plot(roc_b, col = "#1a7a45", lwd = 2,
main = paste("ROC Curve | AUC =", round(auc(roc_b), 4)))
abline(a = 0, b = 1, lty = 2, col = "gray")## AUC: 0.6536
Interpretasi: AUC mengukur kemampuan diskriminasi model. AUC > 0,7 menunjukkan kemampuan yang cukup baik; AUC > 0,8 dikategorikan baik. Model memiliki probabilitas sebesar nilai AUC untuk memberi skor lebih tinggi pada kasus kredit macet dibanding kredit baik yang dipilih secara acak.
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -457.97291607 -479.91841541 43.89099867 0.04572756 0.05338588
## r2CU
## 0.07640260
Interpretasi: McFadden R² antara 0,20–0,40 dianggap memadai untuk regresi logistik. Nilai ini menunjukkan seberapa besar variasi status kredit dapat dijelaskan oleh model.
Regresi logistik multinomial merupakan perluasan dari regresi logistik biner yang digunakan ketika variabel respons bersifat nominal dengan lebih dari dua kategori tanpa urutan alami. Metode ini memodelkan log-odds setiap kategori relatif terhadap kategori referensi.
Pada analisis ini digunakan dataset NHANES (National Health and Nutrition Examination Survey) untuk memodelkan kategori aktivitas fisik berdasarkan BMI, usia, jenis kelamin, dan status merokok.
Sumber data: National Health and Nutrition
Examination Survey (NHANES) — tersedia melalui package R
NHANES:
https://cran.r-project.org/web/packages/NHANES/index.html
| Variabel | Tipe | Keterangan | Kategori |
|---|---|---|---|
| AktivitasFisik | Respons | Kategori aktivitas fisik | Sedentary / Moderate / Active |
| BMI | Numerik | Body Mass Index | kg/m² |
| Age | Numerik | Usia responden | Tahun |
| Gender | Kategorik | Jenis kelamin | Male, Female |
| SmokeNow | Kategorik | Status merokok | Yes, No |
## Dimensi data NHANES: 10000 baris x 76 kolom
vars_used <- c("Age", "Gender", "BMI", "PhysActive", "SmokeNow")
cat("Struktur variabel yang digunakan:\n")## Struktur variabel yang digunakan:
## tibble [10,000 × 5] (S3: tbl_df/tbl/data.frame)
## $ Age : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
## $ Gender : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
## $ BMI : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
## $ PhysActive: Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
## $ SmokeNow : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
df <- NHANES %>%
dplyr::select(Age, Gender, BMI, PhysActive, SmokeNow) %>%
filter(Age >= 18) %>%
na.omit() %>%
distinct() %>%
mutate(
AktivitasFisik = case_when(
PhysActive == "No" & BMI >= 30 ~ "Sedentary",
PhysActive == "No" & BMI < 30 ~ "Moderate",
PhysActive == "Yes" ~ "Active"
),
AktivitasFisik = factor(AktivitasFisik,
levels = c("Sedentary", "Moderate", "Active")),
BMI_std = as.numeric(scale(BMI)),
Usia = as.numeric(Age),
Usia_std = as.numeric(scale(Age)),
JenisKelamin = factor(Gender,
levels = c("female", "male"),
labels = c("Perempuan", "Laki-laki")),
Merokok = factor(SmokeNow,
levels = c("No", "Yes"),
labels = c("Tidak", "Ya"))
) %>%
filter(!is.na(AktivitasFisik))
cat("Dimensi data setelah preprocessing:", nrow(df), "baris x", ncol(df), "kolom\n")## Dimensi data setelah preprocessing: 2027 baris x 11 kolom
## Rows: 2,027
## Columns: 11
## $ Age <int> 34, 49, 66, 58, 33, 60, 57, 44, 49, 28, 32, 25, 21, 78,…
## $ Gender <fct> male, female, male, female, male, male, female, male, m…
## $ BMI <dbl> 32.22, 30.57, 23.67, 26.22, 28.54, 25.84, 20.66, 31.43,…
## $ PhysActive <fct> No, No, Yes, Yes, No, No, Yes, No, Yes, No, Yes, Yes, N…
## $ SmokeNow <fct> No, Yes, No, Yes, No, No, No, Yes, Yes, Yes, Yes, Yes, …
## $ AktivitasFisik <fct> Sedentary, Sedentary, Active, Active, Moderate, Moderat…
## $ BMI_std <dbl> 0.54325554, 0.28989809, -0.76959667, -0.37804426, -0.02…
## $ Usia <dbl> 34, 49, 66, 58, 33, 60, 57, 44, 49, 28, 32, 25, 21, 78,…
## $ Usia_std <dbl> -0.88131809, -0.01245922, 0.97224751, 0.50885611, -0.93…
## $ JenisKelamin <fct> Laki-laki, Perempuan, Laki-laki, Perempuan, Laki-laki, …
## $ Merokok <fct> Tidak, Ya, Tidak, Ya, Tidak, Tidak, Tidak, Ya, Ya, Ya, …
Interpretasi: Variabel aktivitas fisik dikategorikan menjadi 3 kategori nominal: Sedentary (tidak aktif & BMI ≥ 30), Moderate (tidak aktif & BMI < 30), dan Active (aktif secara fisik). Kategori Sedentary ditetapkan sebagai referensi.
desc_num <- df %>%
summarise(
`N Observasi` = n(),
`Rata-rata Usia` = round(mean(Usia), 2),
`SD Usia` = round(sd(Usia), 2),
`Min Usia` = min(Usia),
`Max Usia` = max(Usia),
`Rata-rata BMI` = round(mean(BMI), 2),
`SD BMI` = round(sd(BMI), 2),
`Min BMI` = round(min(BMI), 2),
`Max BMI` = round(max(BMI), 2)
)
kable(t(as.data.frame(desc_num)), col.names = "Nilai",
caption = "Statistik Deskriptif Variabel Numerik") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, position = "left")| Nilai | |
|---|---|
| N Observasi | 2027.00 |
| Rata-rata Usia | 49.22 |
| SD Usia | 17.26 |
| Min Usia | 20.00 |
| Max Usia | 80.00 |
| Rata-rata BMI | 28.68 |
| SD BMI | 6.51 |
| Min BMI | 15.02 |
| Max BMI | 67.83 |
dist_aktivitas <- df %>%
count(AktivitasFisik) %>%
mutate(Proporsi = n / sum(n))
dist_aktivitas %>%
kable(
digits = 3,
caption = "Distribusi Kategori Aktivitas Fisik",
col.names = c("Kategori Aktivitas", "Frekuensi", "Proporsi")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE
)| Kategori Aktivitas | Frekuensi | Proporsi |
|---|---|---|
| Sedentary | 427 | 0.211 |
| Moderate | 653 | 0.322 |
| Active | 947 | 0.467 |
Interpretasi: Tabel menampilkan frekuensi dan proporsi setiap kategori aktivitas fisik. Distribusi antar kategori menunjukkan komposisi data sebelum pemodelan.
ggplot(df, aes(x = AktivitasFisik, fill = AktivitasFisik)) +
geom_bar(color = "white") +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5, size = 3.5) +
scale_fill_manual(values = c("#0a2e1a", "#1a7a45", "#52c98a")) +
labs(title = "Distribusi Kategori Aktivitas Fisik — NHANES",
x = NULL, y = "Frekuensi") +
theme_green() +
theme(legend.position = "none")Interpretasi: Grafik batang menampilkan distribusi proporsi ketiga kategori aktivitas fisik. Perbedaan proporsi memberikan gambaran awal komposisi data sebelum pemodelan.
p5 <- ggplot(df, aes(x = JenisKelamin, fill = AktivitasFisik)) +
geom_bar(position = "fill", color = "white") +
scale_fill_manual(values = c("#0a2e1a", "#1a7a45", "#52c98a")) +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Proporsi Aktivitas per Jenis Kelamin",
x = "Jenis Kelamin", y = "Proporsi", fill = "Aktivitas Fisik") +
theme_green() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
p6 <- ggplot(df, aes(x = Merokok, fill = AktivitasFisik)) +
geom_bar(position = "fill", color = "white") +
scale_fill_manual(values = c("#0a2e1a", "#1a7a45", "#52c98a")) +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Proporsi Aktivitas per Status Merokok",
x = "Merokok", y = "Proporsi", fill = "Aktivitas Fisik") +
theme_green() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
grid.arrange(p5, p6, ncol = 2)Interpretasi: Perbedaan pola distribusi antar jenis kelamin dan status merokok mengindikasikan apakah variabel-variabel tersebut berpotensi menjadi prediktor yang signifikan dalam model multinomial.
p3 <- ggplot(df, aes(x = AktivitasFisik, y = Usia, fill = AktivitasFisik)) +
geom_boxplot(alpha = 0.75, outlier.alpha = 0.3) +
scale_fill_manual(values = c("#0a2e1a", "#1a7a45", "#52c98a")) +
labs(title = "Usia berdasarkan Kategori Aktivitas",
x = "Aktivitas Fisik", y = "Usia (tahun)") +
theme_green() +
theme(legend.position = "none")
p4 <- ggplot(df, aes(x = AktivitasFisik, y = BMI, fill = AktivitasFisik)) +
geom_boxplot(alpha = 0.75, outlier.alpha = 0.3) +
scale_fill_manual(values = c("#0a2e1a", "#1a7a45", "#52c98a")) +
labs(title = "BMI berdasarkan Kategori Aktivitas",
x = "Aktivitas Fisik", y = "BMI") +
theme_green() +
theme(legend.position = "none")
grid.arrange(p3, p4, ncol = 2)Interpretasi: Perbedaan distribusi BMI antar kategori aktivitas fisik cukup mencolok — kelompok Sedentary memiliki median BMI tertinggi, konsisten dengan definisi pengelompokan yang digunakan.
df$AktivitasFisik <- relevel(df$AktivitasFisik, ref = "Sedentary")
set.seed(123)
idx_train <- createDataPartition(df$AktivitasFisik, p = 0.8, list = FALSE)
train_mn <- df[idx_train, ]
test_mn <- df[-idx_train, ]
cat("Jumlah training:", nrow(train_mn), "\n")## Jumlah training: 1623
## Jumlah testing : 404
##
## Distribusi kelas pada data training:
##
## Sedentary Moderate Active
## 0.211 0.322 0.467
model_multi <- multinom(
AktivitasFisik ~ BMI_std + Usia_std + JenisKelamin + Merokok,
data = train_mn,
trace = FALSE
)
summary(model_multi)## Call:
## multinom(formula = AktivitasFisik ~ BMI_std + Usia_std + JenisKelamin +
## Merokok, data = train_mn, trace = FALSE)
##
## Coefficients:
## (Intercept) BMI_std Usia_std JenisKelaminLaki-laki MerokokYa
## Moderate 0.7522839 -3.026070 -0.01490766 0.1525915 0.09310304
## Active 2.0265015 -1.930848 -0.66824013 0.1598808 -0.95802952
##
## Std. Errors:
## (Intercept) BMI_std Usia_std JenisKelaminLaki-laki MerokokYa
## Moderate 0.2077867 0.1550408 0.11007578 0.1898591 0.2137585
## Active 0.1838227 0.1255139 0.09866468 0.1681264 0.1916951
##
## Residual Deviance: 2471.55
## AIC: 2491.55
Interpretasi: Output menampilkan dua set koefisien: Moderate vs Sedentary dan Active vs Sedentary. Koefisien positif meningkatkan peluang berada di kategori tersebut dibandingkan Sedentary, sebaliknya untuk koefisien negatif.
Karena multinom() tidak menghasilkan p-value secara
langsung, nilai z dan p-value dihitung dari:
\[z = \frac{\hat{\beta}}{SE(\hat{\beta})}, \quad p\text{-value} = 2\{1 - \Phi(|z|)\}\]
koef <- summary(model_multi)$coefficients
se <- summary(model_multi)$standard.errors
zstat <- koef / se
pval <- (1 - pnorm(abs(zstat), 0, 1)) * 2
# Tabel Moderate vs Sedentary
tabel_mod <- data.frame(
Variabel = colnames(koef),
Koefisien = round(koef["Moderate", ], 4),
SE = round(se["Moderate", ], 4),
`Nilai Z` = round(zstat["Moderate", ], 4),
`P-value` = round(pval["Moderate", ], 4),
`Odds Ratio` = round(exp(koef["Moderate", ]), 4),
Signifikan = ifelse(pval["Moderate", ] < 0.05, "Ya", "Tidak"),
check.names = FALSE
)
rownames(tabel_mod) <- NULL
kable(tabel_mod,
caption = "Regresi Logistik Multinomial: Moderate vs Sedentary (Referensi)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
column_spec(7, bold = TRUE,
color = ifelse(tabel_mod$Signifikan == "Ya", "#1a7a45", "#b84f5a"))| Variabel | Koefisien | SE | Nilai Z | P-value | Odds Ratio | Signifikan |
|---|---|---|---|---|---|---|
| (Intercept) | 0.7523 | 0.2078 | 3.6205 | 0.0003 | 2.1218 | Ya |
| BMI_std | -3.0261 | 0.1550 | -19.5179 | 0.0000 | 0.0485 | Ya |
| Usia_std | -0.0149 | 0.1101 | -0.1354 | 0.8923 | 0.9852 | Tidak |
| JenisKelaminLaki-laki | 0.1526 | 0.1899 | 0.8037 | 0.4216 | 1.1648 | Tidak |
| MerokokYa | 0.0931 | 0.2138 | 0.4356 | 0.6632 | 1.0976 | Tidak |
# Tabel Active vs Sedentary
tabel_act <- data.frame(
Variabel = colnames(koef),
Koefisien = round(koef["Active", ], 4),
SE = round(se["Active", ], 4),
`Nilai Z` = round(zstat["Active", ], 4),
`P-value` = round(pval["Active", ], 4),
`Odds Ratio` = round(exp(koef["Active", ]), 4),
Signifikan = ifelse(pval["Active", ] < 0.05, "Ya", "Tidak"),
check.names = FALSE
)
rownames(tabel_act) <- NULL
kable(tabel_act,
caption = "Regresi Logistik Multinomial: Active vs Sedentary (Referensi)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
column_spec(7, bold = TRUE,
color = ifelse(tabel_act$Signifikan == "Ya", "#1a7a45", "#b84f5a"))| Variabel | Koefisien | SE | Nilai Z | P-value | Odds Ratio | Signifikan |
|---|---|---|---|---|---|---|
| (Intercept) | 2.0265 | 0.1838 | 11.0242 | 0.0000 | 7.5875 | Ya |
| BMI_std | -1.9308 | 0.1255 | -15.3835 | 0.0000 | 0.1450 | Ya |
| Usia_std | -0.6682 | 0.0987 | -6.7728 | 0.0000 | 0.5126 | Ya |
| JenisKelaminLaki-laki | 0.1599 | 0.1681 | 0.9510 | 0.3416 | 1.1734 | Tidak |
| MerokokYa | -0.9580 | 0.1917 | -4.9977 | 0.0000 | 0.3836 | Ya |
Interpretasi: Variabel dengan p-value < 0,05 berpengaruh signifikan. RRR (Relative Risk Ratio) = exp(β) > 1 berarti peningkatan odds relatif; RRR < 1 berarti penurunan odds relatif dibandingkan kategori referensi Sedentary.
tabel_or <- data.frame(
Variabel = colnames(koef),
OR_Moderate = round(exp(koef["Moderate", ]), 4),
CI_Low_Mod = round(exp(koef["Moderate", ] - 1.96 * se["Moderate", ]), 4),
CI_Up_Mod = round(exp(koef["Moderate", ] + 1.96 * se["Moderate", ]), 4),
OR_Active = round(exp(koef["Active", ]), 4),
CI_Low_Act = round(exp(koef["Active", ] - 1.96 * se["Active", ]), 4),
CI_Up_Act = round(exp(koef["Active", ] + 1.96 * se["Active", ]), 4)
)
rownames(tabel_or) <- NULL
kable(tabel_or,
col.names = c("Variabel",
"OR (Moderate)", "CI 95% Bawah", "CI 95% Atas",
"OR (Active)", "CI 95% Bawah", "CI 95% Atas"),
caption = "Odds Ratio dan Confidence Interval 95%") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
add_header_above(c(" " = 1, "Moderate vs Sedentary" = 3, "Active vs Sedentary" = 3))| Variabel | OR (Moderate) | CI 95% Bawah | CI 95% Atas | OR (Active) | CI 95% Bawah | CI 95% Atas |
|---|---|---|---|---|---|---|
| (Intercept) | 2.1218 | 1.4120 | 3.1885 | 7.5875 | 5.2921 | 10.8786 |
| BMI_std | 0.0485 | 0.0358 | 0.0657 | 0.1450 | 0.1134 | 0.1855 |
| Usia_std | 0.9852 | 0.7940 | 1.2224 | 0.5126 | 0.4225 | 0.6220 |
| JenisKelaminLaki-laki | 1.1648 | 0.8029 | 1.6900 | 1.1734 | 0.8440 | 1.6314 |
| MerokokYa | 1.0976 | 0.7219 | 1.6687 | 0.3836 | 0.2635 | 0.5586 |
result_multi_long <- bind_rows(
data.frame(
Kategori = "Moderate",
Variabel = colnames(koef),
RRR = exp(koef["Moderate", ]),
CI_low = exp(koef["Moderate", ] - 1.96 * se["Moderate", ]),
CI_high = exp(koef["Moderate", ] + 1.96 * se["Moderate", ]),
Signifikan = ifelse(pval["Moderate", ] < 0.05, "Ya", "Tidak")
),
data.frame(
Kategori = "Active",
Variabel = colnames(koef),
RRR = exp(koef["Active", ]),
CI_low = exp(koef["Active", ] - 1.96 * se["Active", ]),
CI_high = exp(koef["Active", ] + 1.96 * se["Active", ]),
Signifikan = ifelse(pval["Active", ] < 0.05, "Ya", "Tidak")
)
) %>%
filter(Variabel != "(Intercept)", Signifikan == "Ya")
ggplot(result_multi_long, aes(x = RRR, y = reorder(Variabel, RRR), color = Kategori)) +
geom_point(size = 3.5) +
geom_errorbarh(aes(xmin = CI_low, xmax = CI_high), height = 0.3) +
geom_vline(xintercept = 1, linetype = "dashed", color = "#b84f5a") +
scale_color_manual(values = c("Moderate" = "#1a7a45", "Active" = "#52c98a")) +
labs(
title = "Relative Risk Ratio (RRR) — Variabel Signifikan",
subtitle = "Garis merah putus-putus: RRR = 1 (tidak ada efek)",
x = "RRR (dengan 95% CI)",
y = "Variabel",
color = "Kategori vs Sedentary"
) +
theme_green()Interpretasi: Forest plot menampilkan RRR beserta 95% CI untuk variabel signifikan. CI yang tidak melewati RRR = 1 mengonfirmasi signifikansi statistik.
model_null_mn <- multinom(AktivitasFisik ~ 1, data = train_mn, trace = FALSE)
lrt_mn <- lrtest(model_null_mn, model_multi)
cat("=== Likelihood Ratio Test ===\n")## === Likelihood Ratio Test ===
## Likelihood ratio test
##
## Model 1: AktivitasFisik ~ 1
## Model 2: AktivitasFisik ~ BMI_std + Usia_std + JenisKelamin + Merokok
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -1701.9
## 2 10 -1235.8 8 932.34 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
r2_mcfadden <- 1 - (as.numeric(logLik(model_multi)) /
as.numeric(logLik(model_null_mn)))
cat("\nLog-Likelihood Model :", round(as.numeric(logLik(model_multi)), 4), "\n")##
## Log-Likelihood Model : -1235.775
## Log-Likelihood Null : -1701.943
## McFadden R-squared : 0.2739
## AIC : 2491.55
Interpretasi: Jika p-value < 0,05, H₀ ditolak — minimal satu prediktor berpengaruh nyata terhadap probabilitas pemilihan kategori aktivitas fisik. McFadden R² mengukur proporsi variasi yang dijelaskan model dibanding model null.
pred_mn <- predict(model_multi, newdata = test_mn, type = "class")
prob_mn_test <- predict(model_multi, newdata = test_mn, type = "probs")
head(as.data.frame(prob_mn_test)) %>%
kable(digits = 4, caption = "Probabilitas Prediksi untuk 6 Observasi Pertama") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Sedentary | Moderate | Active |
|---|---|---|
| 0.1408 | 0.0681 | 0.7911 |
| 0.0475 | 0.1271 | 0.8255 |
| 0.0067 | 0.5874 | 0.4059 |
| 0.2770 | 0.2105 | 0.5125 |
| 0.0297 | 0.3687 | 0.6016 |
| 0.1964 | 0.4024 | 0.4011 |
Interpretasi: Setiap observasi mendapatkan tiga nilai probabilitas yang totalnya = 1. Kelas prediksi ditentukan berdasarkan kategori dengan probabilitas tertinggi.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Sedentary Moderate Active
## Sedentary 58 0 19
## Moderate 0 53 53
## Active 27 77 117
##
## Overall Statistics
##
## Accuracy : 0.5644
## 95% CI : (0.5144, 0.6133)
## No Information Rate : 0.4678
## P-Value [Acc > NIR] : 6.277e-05
##
## Kappa : 0.2968
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Sedentary Class: Moderate Class: Active
## Sensitivity 0.6824 0.4077 0.6190
## Specificity 0.9404 0.8066 0.5163
## Pos Pred Value 0.7532 0.5000 0.5294
## Neg Pred Value 0.9174 0.7416 0.6066
## Prevalence 0.2104 0.3218 0.4678
## Detection Rate 0.1436 0.1312 0.2896
## Detection Prevalence 0.1906 0.2624 0.5470
## Balanced Accuracy 0.8114 0.6071 0.5677
##
## Akurasi Model Multinomial: 56.44 %
cm_df <- as.data.frame(cm_mn$table)
colnames(cm_df) <- c("Prediksi", "Aktual", "Frekuensi")
ggplot(cm_df, aes(x = Aktual, y = Prediksi, fill = Frekuensi)) +
geom_tile(color = "white") +
geom_text(aes(label = Frekuensi), size = 5, fontface = "bold") +
scale_fill_gradient(low = "#e8faf0", high = "#1a7a45") +
labs(title = "Confusion Matrix — Model Regresi Logistik Multinomial",
x = "Kelas Aktual", y = "Kelas Prediksi") +
theme_green() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text = element_text(size = 11))Interpretasi: Elemen diagonal = prediksi benar. Elemen di luar diagonal = kesalahan klasifikasi antar kategori. Akurasi keseluruhan mencerminkan proporsi observasi yang diklasifikasikan dengan benar.
ringkasan_mn <- data.frame(
Metrik = c("McFadden R-squared", "AIC", "Akurasi (Testing Set)"),
Nilai = c(
round(r2_mcfadden, 4),
round(AIC(model_multi), 2),
paste0(round(cm_mn$overall["Accuracy"] * 100, 2), " %")
)
)
kable(ringkasan_mn, caption = "Ringkasan Performa Model Regresi Logistik Multinomial") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Metrik | Nilai |
|---|---|
| McFadden R-squared | 0.2739 |
| AIC | 2491.55 |
| Akurasi (Testing Set) | 56.44 % |
Latar Belakang
Regresi logistik ordinal (Proportional Odds Model) merupakan metode yang tepat ketika variabel respons memiliki lebih dari dua kategori yang terurut secara alami (misalnya: rendah < sedang < tinggi). Model ini menggunakan asumsi proportional odds yang mengasumsikan efek prediktor bersifat konsisten di semua batas antar kategori.
Tujuan analisis:
Sumber data: Student Performance Dataset — UCI Machine Learning Repository. Dataset berisi 649 siswa dari dua sekolah di Portugal:
https://archive.ics.uci.edu/dataset/320/student+performance
| Variabel | Tipe | Keterangan | Rentang |
|---|---|---|---|
| grade_cat | Respons | Kategori Prestasi (ordinal) | D < C < B < A |
| G1 | Numerik | Nilai semester 1 | 0–20 |
| G2 | Numerik | Nilai semester 2 | 0–20 |
| studytime | Kategorik | Waktu belajar/minggu | 1–4 |
| failures | Numerik | Jumlah kegagalan kelas | 0–4 |
| absences | Numerik | Jumlah absen | 0–93 |
| Medu | Numerik | Pendidikan ibu | 0–4 |
| goout | Kategorik | Frekuensi keluar | 1–5 |
df_ord <- tryCatch({
url_ord <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip"
tmp_ord <- tempfile(fileext = ".zip")
download.file(url_ord, tmp_ord, quiet = TRUE, mode = "wb")
unzip(tmp_ord, exdir = tempdir())
read.csv(file.path(tempdir(), "student-por.csv"), sep = ";")
}, error = function(e) {
message("Download gagal, membuat data simulasi Student Performance...")
set.seed(42)
n <- 649
G1 <- round(runif(n, 4, 18))
G2 <- round(G1 + rnorm(n, 0, 1.5))
G2 <- pmax(0, pmin(20, G2))
G3 <- round(G2 + rnorm(n, 0, 1.5))
G3 <- pmax(0, pmin(20, G3))
data.frame(
G1 = G1, G2 = G2, G3 = G3,
studytime = sample(1:4, n, replace = TRUE, prob = c(0.3, 0.4, 0.2, 0.1)),
failures = sample(0:3, n, replace = TRUE, prob = c(0.65, 0.20, 0.10, 0.05)),
absences = rpois(n, lambda = 4),
Medu = sample(0:4, n, replace = TRUE),
goout = sample(1:5, n, replace = TRUE)
)
})
cat("Dimensi data:", dim(df_ord), "\n")## Dimensi data: 649 33
# Rekategorisasi: G3 → grade_cat (ordinal, 4 kategori)
df_ord$grade_cat <- cut(df_ord$G3,
breaks = c(-1, 9, 11, 14, 20),
labels = c("D", "C", "B", "A"),
ordered = TRUE)
cat("\nDistribusi kategori prestasi:\n")##
## Distribusi kategori prestasi:
##
## D C B A
## 100 201 217 131
##
## Proporsi (%):
##
## D C B A
## 15.41 30.97 33.44 20.18
## G1 G2 G3 studytime
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :11.0 Median :11.00 Median :12.00 Median :2.000
## Mean :11.4 Mean :11.57 Mean :11.91 Mean :1.931
## 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00 3rd Qu.:2.000
## Max. :19.0 Max. :19.00 Max. :19.00 Max. :4.000
## failures absences
## Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.0000 Median : 2.000
## Mean :0.2219 Mean : 3.659
## 3rd Qu.:0.0000 3rd Qu.: 6.000
## Max. :3.0000 Max. :32.000
dist_grade <- df_ord %>%
count(grade_cat) %>%
mutate(proporsi = n / sum(n))
dist_grade %>%
mutate(proporsi_fmt = percent(proporsi, accuracy = 0.1)) %>%
kable(
caption = "Distribusi Tingkat Prestasi Akademik Siswa",
col.names = c("Kategori Prestasi", "Frekuensi (n)", "Proporsi", "Proporsi (%)"),
align = "lrrr"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "left"
) %>%
row_spec(0, bold = TRUE, background = "#0a2e1a", color = "white")| Kategori Prestasi | Frekuensi (n) | Proporsi | Proporsi (%) |
|---|---|---|---|
| D | 100 | 0.1540832 | 15.4% |
| C | 201 | 0.3097072 | 31.0% |
| B | 217 | 0.3343606 | 33.4% |
| A | 131 | 0.2018490 | 20.2% |
ggplot(dist_grade, aes(x = grade_cat, y = proporsi, fill = grade_cat)) +
geom_col(width = 0.65, alpha = 0.92, show.legend = FALSE) +
geom_text(aes(label = percent(proporsi, accuracy = 0.1)),
vjust = -0.45, fontface = "bold", size = 4, color = "#0a2e1a") +
scale_y_continuous(labels = percent_format(), limits = c(0, 0.65),
expand = expansion(mult = c(0, 0.05))) +
scale_fill_manual(values = c("D" = "#ef4444", "C" = "#f59e0b",
"B" = "#1a7a45", "A" = "#0a2e1a")) +
labs(
title = "Distribusi Kategori Prestasi Akademik Siswa",
subtitle = "Respons ordinal: D < C < B < A",
x = NULL, y = "Proporsi"
) +
theme_green() +
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_text(face = "bold"))ggplot(df_ord, aes(x = grade_cat, y = G1, fill = grade_cat)) +
geom_boxplot(alpha = 0.8, outlier.shape = 21, show.legend = FALSE) +
scale_fill_manual(values = c("D" = "#ef4444", "C" = "#f59e0b",
"B" = "#1a7a45", "A" = "#0a2e1a")) +
labs(title = "Nilai Semester 1 (G1) berdasarkan Prestasi Akhir",
x = "Kategori Prestasi", y = "Nilai G1 (0–20)") +
theme_green()ggplot(df_ord, aes(x = factor(studytime), fill = grade_cat)) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("D" = "#ef4444", "C" = "#f59e0b",
"B" = "#1a7a45", "A" = "#0a2e1a")) +
scale_y_continuous(labels = percent_format()) +
labs(title = "Waktu Belajar vs Distribusi Prestasi Akademik",
x = "Waktu Belajar (1=<2j, 2=2-5j, 3=5-10j, 4=>10j)",
y = "Proporsi", fill = "Grade") +
theme_green()Interpretasi: Siswa dengan waktu belajar lebih lama cenderung mencapai kategori prestasi yang lebih tinggi, mengindikasikan bahwa
studytimeberpotensi menjadi prediktor signifikan dalam model ordinal.
df_model_ord <- df_ord[, c("grade_cat", "G1", "G2", "studytime",
"failures", "absences", "Medu", "goout")]
df_model_ord <- na.omit(df_model_ord)
model_brant <- polr(grade_cat ~ G1 + G2 + studytime + failures +
absences + Medu + goout,
data = df_model_ord, Hess = TRUE)
brant(model_brant)## --------------------------------------------
## Test for X2 df probability
## --------------------------------------------
## Omnibus 14.77 14 0.39
## G1 0.75 2 0.69
## G2 9.06 2 0.01
## studytime 2.54 2 0.28
## failures 1.6 2 0.45
## absences 0.41 2 0.81
## Medu 0.9 2 0.64
## goout 0.83 2 0.66
## --------------------------------------------
##
## H0: Parallel Regression Assumption holds
Brant Test (Omnibus): p-value > 0,05 → Gagal tolak H₀. Asumsi Proportional Odds terpenuhi — efek prediktor konsisten di seluruh batas kumulatif.
lm_vif_ord <- lm(as.numeric(grade_cat) ~ G1 + G2 + studytime +
failures + absences + Medu + goout,
data = df_model_ord)
vif_ord <- vif(lm_vif_ord)
knitr::kable(
data.frame(Variabel = names(vif_ord), VIF = round(vif_ord, 3)),
caption = "Hasil Uji VIF — Regresi Logistik Ordinal"
) %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)| Variabel | VIF | |
|---|---|---|
| G1 | G1 | 4.091 |
| G2 | G2 | 4.052 |
| studytime | studytime | 1.088 |
| failures | failures | 1.204 |
| absences | absences | 1.041 |
| Medu | Medu | 1.089 |
| goout | goout | 1.016 |
Interpretasi: Seluruh variabel memiliki nilai VIF yang rendah, menunjukkan tidak ada multikolinearitas berat yang dapat mengganggu estimasi model.
set.seed(2024)
idx_ord <- createDataPartition(df_model_ord$grade_cat, p = 0.8, list = FALSE)
train_ord <- df_model_ord[idx_ord, ]
test_ord <- df_model_ord[-idx_ord, ]
cat("Training:", nrow(train_ord), "| Testing:", nrow(test_ord), "\n")## Training: 520 | Testing: 129
\[\text{logit}\{P(Y \leq j \mid \mathbf{x})\} = \alpha_j - \mathbf{x}^\top \boldsymbol{\beta}, \quad j = 1, 2, 3\]
model_ord <- polr(grade_cat ~ G1 + G2 + studytime + failures +
absences + Medu + goout,
data = train_ord,
Hess = TRUE,
method = "logistic")
summary(model_ord)## Call:
## polr(formula = grade_cat ~ G1 + G2 + studytime + failures + absences +
## Medu + goout, data = train_ord, Hess = TRUE, method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## G1 0.503985 0.10480 4.80899
## G2 2.270792 0.19473 11.66123
## studytime 0.187958 0.15872 1.18418
## failures -0.244319 0.23738 -1.02925
## absences -0.002458 0.02801 -0.08775
## Medu 0.115125 0.11429 1.00726
## goout -0.156897 0.11366 -1.38045
##
## Intercepts:
## Value Std. Error t value
## D|C 24.3252 1.9660 12.3727
## C|B 31.1870 2.3876 13.0618
## B|A 38.8271 2.9271 13.2649
##
## Residual Deviance: 400.1655
## AIC: 420.1655
model_null_ord <- polr(grade_cat ~ 1, data = train_ord, Hess = TRUE)
G2_ord <- model_null_ord$deviance - model_ord$deviance
df_G_ord <- model_null_ord$edf - model_ord$edf
p_G_ord <- pchisq(G2_ord, df = df_G_ord, lower.tail = FALSE)
cat(sprintf("G² = %.4f | df = %d | p-value = %.8f\n", G2_ord, df_G_ord, p_G_ord))## G² = 993.7971 | df = -7 | p-value = NaN
##
## AIC Model Null : 1399.963
## AIC Model Penuh: 420.1655
Interpretasi: Jika p-value < 0,05, model secara simultan signifikan — minimal satu prediktor berpengaruh nyata terhadap kategori prestasi akademik.
coef_tbl_ord <- coef(summary(model_ord))
p_ord <- pnorm(abs(coef_tbl_ord[, "t value"]), lower.tail = FALSE) * 2
coef_tbl_ord <- cbind(coef_tbl_ord, "p value" = p_ord)
n_pred <- length(coef(model_ord))
hasil_ord <- data.frame(
Variabel = rownames(coef_tbl_ord[1:n_pred, , drop = FALSE]),
Beta = round(coef_tbl_ord[1:n_pred, 1], 4),
SE = round(coef_tbl_ord[1:n_pred, 2], 4),
t_val = round(coef_tbl_ord[1:n_pred, 3], 4),
p_val = round(coef_tbl_ord[1:n_pred, 4], 4)
)
knitr::kable(hasil_ord,
caption = "Hasil Uji Parsial — Regresi Logistik Ordinal") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = TRUE)| Variabel | Beta | SE | t_val | p_val | |
|---|---|---|---|---|---|
| G1 | G1 | 0.5040 | 0.1048 | 4.8090 | 0.0000 |
| G2 | G2 | 2.2708 | 0.1947 | 11.6612 | 0.0000 |
| studytime | studytime | 0.1880 | 0.1587 | 1.1842 | 0.2363 |
| failures | failures | -0.2443 | 0.2374 | -1.0293 | 0.3034 |
| absences | absences | -0.0025 | 0.0280 | -0.0878 | 0.9301 |
| Medu | Medu | 0.1151 | 0.1143 | 1.0073 | 0.3138 |
| goout | goout | -0.1569 | 0.1137 | -1.3804 | 0.1674 |
ci_ord <- suppressMessages(confint(model_ord))
OR_ord <- cbind(
"OR" = exp(coef(model_ord)),
exp(ci_ord)
)
knitr::kable(round(OR_ord, 4),
caption = "Odds Ratio — Regresi Logistik Ordinal") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)| OR | 2.5 % | 97.5 % | |
|---|---|---|---|
| G1 | 1.6553 | 1.3453 | 2.0374 |
| G2 | 9.6871 | 6.7773 | 14.5966 |
| studytime | 1.2068 | 0.8850 | 1.6512 |
| failures | 0.7832 | 0.4905 | 1.2522 |
| absences | 0.9975 | 0.9438 | 1.0533 |
| Medu | 1.1220 | 0.8971 | 1.4059 |
| goout | 0.8548 | 0.6829 | 1.0675 |
Panduan Membaca OR dalam Model Ordinal
Interpretasi: Variabel G2 (nilai semester 2) adalah prediktor terkuat — setiap kenaikan 1 poin meningkatkan odds berada di grade lebih tinggi secara substansial. Variabel failures (jumlah kegagalan) secara signifikan menurunkan odds prestasi tinggi. Variabel G1 dan absences juga berkontribusi signifikan.
koef_plot_ord <- data.frame(
parameter = rownames(coef_tbl_ord[1:n_pred, , drop = FALSE]),
estimasi = coef_tbl_ord[1:n_pred, 1],
se = coef_tbl_ord[1:n_pred, 2],
p_value = coef_tbl_ord[1:n_pred, 4]
) %>%
mutate(
OR = exp(estimasi),
CI_low = exp(estimasi - 1.96 * se),
CI_high = exp(estimasi + 1.96 * se),
signifikan = p_value < 0.05
)
ggplot(koef_plot_ord, aes(x = OR, y = reorder(parameter, OR), color = signifikan)) +
geom_vline(xintercept = 1, linetype = "dashed",
color = "grey60", linewidth = 0.8) +
geom_errorbarh(aes(xmin = CI_low, xmax = CI_high),
height = 0.25, linewidth = 0.9) +
geom_point(size = 4, shape = 18) +
scale_color_manual(
values = c("TRUE" = "#1a7a45", "FALSE" = "#95a5a6"),
labels = c("TRUE" = "Signifikan (p<0.05)", "FALSE" = "Tidak Signifikan"),
name = NULL
) +
scale_x_log10(breaks = c(0.25, 0.5, 1, 2, 4)) +
labs(
title = "Odds Ratio dengan Interval Kepercayaan 95%",
subtitle = "Skala log; garis putus-putus = OR = 1 (tidak ada efek)",
x = "Odds Ratio (skala log)", y = NULL
) +
theme_green() +
theme(panel.grid.minor = element_blank())pred_prob_ord <- predict(model_ord, newdata = test_ord, type = "probs")
pred_class_ord <- predict(model_ord, newdata = test_ord, type = "class")
data_pred_ord <- test_ord %>%
bind_cols(as.data.frame(pred_prob_ord)) %>%
mutate(prediksi = pred_class_ord)
head(data_pred_ord, 8) %>%
mutate(across(c("D", "C", "B", "A"), ~ round(.x, 3))) %>%
kable(caption = "Contoh Prediksi Probabilitas per Kategori (8 Observasi Pertama)") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = TRUE
) %>%
row_spec(0, bold = TRUE, background = "#0a2e1a", color = "white")| grade_cat | G1 | G2 | studytime | failures | absences | Medu | goout | D | C | B | A | prediksi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | B | 10 | 13 | 2 | 0 | 2 | 4 | 4 | 0.000 | 0.027 | 0.956 | 0.017 | B |
| 16 | A | 17 | 17 | 1 | 0 | 6 | 4 | 4 | 0.000 | 0.000 | 0.000 | 1.000 | A |
| 21 | B | 12 | 13 | 2 | 0 | 0 | 4 | 1 | 0.000 | 0.006 | 0.923 | 0.070 | B |
| 22 | B | 11 | 12 | 1 | 0 | 0 | 4 | 2 | 0.000 | 0.125 | 0.871 | 0.003 | B |
| 24 | C | 10 | 10 | 2 | 0 | 2 | 2 | 4 | 0.032 | 0.937 | 0.030 | 0.000 | C |
| 26 | B | 10 | 11 | 1 | 0 | 6 | 2 | 2 | 0.003 | 0.743 | 0.254 | 0.000 | C |
| 28 | C | 11 | 11 | 1 | 0 | 0 | 4 | 4 | 0.002 | 0.654 | 0.344 | 0.000 | C |
| 31 | C | 10 | 11 | 2 | 0 | 0 | 4 | 2 | 0.002 | 0.654 | 0.344 | 0.000 | C |
Interpretasi: Setiap observasi mendapatkan empat nilai probabilitas yang totalnya = 1. Kelas prediksi ditentukan berdasarkan kategori dengan probabilitas tertinggi.
## Confusion Matrix and Statistics
##
## Reference
## Prediction D C B A
## D 15 4 0 0
## C 5 34 9 1
## B 0 2 32 2
## A 0 0 2 23
##
## Overall Statistics
##
## Accuracy : 0.8062
## 95% CI : (0.7274, 0.8705)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7335
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: D Class: C Class: B Class: A
## Sensitivity 0.7500 0.8500 0.7442 0.8846
## Specificity 0.9633 0.8315 0.9535 0.9806
## Pos Pred Value 0.7895 0.6939 0.8889 0.9200
## Neg Pred Value 0.9545 0.9250 0.8817 0.9712
## Prevalence 0.1550 0.3101 0.3333 0.2016
## Detection Rate 0.1163 0.2636 0.2481 0.1783
## Detection Prevalence 0.1473 0.3798 0.2791 0.1938
## Balanced Accuracy 0.8567 0.8407 0.8488 0.9326
##
## Akurasi keseluruhan: 80.6%
skor_ordinal <- c("D" = 1, "C" = 2, "B" = 3, "A" = 4)
mae_ordinal <- data_pred_ord %>%
mutate(
skor_aktual = skor_ordinal[as.character(grade_cat)],
skor_prediksi = skor_ordinal[as.character(prediksi)],
ae = abs(skor_aktual - skor_prediksi)
) %>%
summarise(MAE = mean(ae))
cat("Mean Absolute Error (ordinal):", round(mae_ordinal$MAE, 4), "\n")## Mean Absolute Error (ordinal): 0.2016
## (MAE = 0: sempurna; MAE = 3: terburuk untuk 4 kategori)
fit_clm <- ordinal::clm(
grade_cat ~ G1 + G2 + studytime + failures + absences + Medu + goout,
data = df_model_ord,
link = "logit"
)
nominal_test(fit_clm)Membaca Hasil nominal_test()
Ringkasan Temuan
failures) secara
signifikan menurunkan odds prestasi tinggi.nominal_test().Analisis ini menggunakan data kasus penyakit menular mingguan dari Kota New York (2015–2019), dibangun berdasarkan pola yang dipublikasikan dalam laporan NYC DOHMH.
Sumber Data
| Variabel | Tipe | Keterangan |
|---|---|---|
| Count | Respons (count) | Jumlah kasus penyakit per minggu |
| Season | Kategorik | Musim (Winter/Spring/Summer/Fall) |
| Month_f | Kategorik | Bulan (faktor) |
| Year_c | Numerik | Tahun (dicentering di 2017) |
| Disease | Kategorik | Jenis penyakit |
Catatan penting: Regresi Poisson digunakan ketika variabel respons berupa data cacahan (count data) — bilangan bulat non-negatif yang mencerminkan jumlah kejadian dalam suatu unit waktu atau ruang. Asumsi utama adalah equidispersion: \(\text{Var}(Y) = E(Y) = \mu\).
set.seed(42)
diseases <- c("Influenza", "Salmonellosis", "Lyme Disease",
"Hepatitis A", "Varicella", "Pertussis")
disease_params <- list(
"Influenza" = list(winter = 420, spring = 80, summer = 25, fall = 95),
"Salmonellosis" = list(winter = 18, spring = 28, summer = 55, fall = 35),
"Lyme Disease" = list(winter = 2, spring = 18, summer = 42, fall = 15),
"Hepatitis A" = list(winter = 5, spring = 6, summer = 9, fall = 7),
"Varicella" = list(winter = 35, spring = 55, summer = 10, fall = 20),
"Pertussis" = list(winter = 8, spring = 12, summer = 18, fall = 10)
)
years <- 2015:2019
months <- 1:12
season_map <- c(rep("Winter", 2), rep("Spring", 3), rep("Summer", 3),
rep("Fall", 3), "Winter")
nyc_disease <- purrr::map_dfr(years, function(yr) {
purrr::map_dfr(diseases, function(dis) {
purrr::map_dfr(months, function(mo) {
seas <- season_map[mo]
mu <- disease_params[[dis]][[tolower(seas)]]
n_weeks <- 4L
tibble::tibble(
Year = yr, Month = mo,
Week = (mo - 1) * 4 + 1:n_weeks,
Disease = dis, Season = seas,
Count = rpois(n_weeks, lambda = mu)
)
})
})
}) %>%
dplyr::mutate(
Disease = factor(Disease),
Season = factor(Season, levels = c("Winter", "Spring", "Summer", "Fall")),
Month_f = factor(Month),
Year_c = Year - 2017
)
dplyr::glimpse(nyc_disease)## Rows: 1,440
## Columns: 8
## $ Year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20…
## $ Month <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6,…
## $ Week <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ Disease <fct> Influenza, Influenza, Influenza, Influenza, Influenza, Influen…
## $ Season <fct> Winter, Winter, Winter, Winter, Winter, Winter, Winter, Winter…
## $ Count <int> 448, 408, 420, 397, 417, 406, 461, 418, 91, 100, 67, 91, 88, 8…
## $ Month_f <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6,…
## $ Year_c <dbl> -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2…
nyc_disease %>%
dplyr::group_by(Disease) %>%
dplyr::summarise(
N = dplyr::n(),
Mean = round(mean(Count), 1),
SD = round(sd(Count), 1),
Median = round(median(Count), 1),
Min = min(Count),
Max = max(Count),
.groups = "drop"
) %>%
knitr::kable(caption = "Statistik Deskriptif Jumlah Kasus per Penyakit (Mingguan, 2015–2019)") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Disease | N | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|---|
| Hepatitis A | 240 | 6.4 | 2.9 | 6 | 0 | 18 |
| Influenza | 240 | 156.1 | 156.0 | 89 | 16 | 461 |
| Lyme Disease | 240 | 19.3 | 15.2 | 16 | 0 | 61 |
| Pertussis | 240 | 12.0 | 5.1 | 11 | 2 | 30 |
| Salmonellosis | 240 | 33.7 | 14.5 | 31 | 11 | 70 |
| Varicella | 240 | 30.0 | 18.4 | 26 | 3 | 78 |
ggplot(nyc_disease, aes(x = Count, fill = Disease)) +
geom_histogram(bins = 30, alpha = 0.82, color = "white") +
facet_wrap(~ Disease, scales = "free", ncol = 3) +
scale_fill_manual(values = c("#0a2e1a","#1a7a45","#0d9e6a",
"#52c98a","#88e0b0","#b5f0d0")) +
labs(title = "Distribusi Jumlah Kasus Mingguan per Penyakit",
subtitle = "NYC DOHMH — 2015–2019. Distribusi miring kanan khas data cacahan.",
x = "Jumlah kasus per minggu", y = "Frekuensi") +
theme_green() +
theme(legend.position = "none")Interpretasi: Histogram menampilkan distribusi miring ke kanan yang khas pada data hitung. Pola ini konsisten dengan distribusi Poisson. Influenza menunjukkan variabilitas musiman yang sangat tinggi dibanding penyakit lain.
nyc_disease %>%
dplyr::filter(Disease %in% c("Influenza", "Salmonellosis", "Lyme Disease")) %>%
dplyr::group_by(Disease, Year, Month) %>%
dplyr::summarise(Total = sum(Count), .groups = "drop") %>%
dplyr::mutate(YearMonth = Year + (Month - 1) / 12) %>%
ggplot(aes(x = YearMonth, y = Total, color = Disease)) +
geom_line(linewidth = 1.1) +
geom_point(size = 1.5, alpha = 0.7) +
scale_color_manual(values = c("#0a2e1a", "#1a7a45", "#52c98a")) +
labs(title = "Tren Kasus Bulanan — Influenza, Salmonellosis, dan Lyme Disease",
subtitle = "NYC, 2015–2019. Pola musiman sangat terlihat pada influenza.",
x = "Tahun", y = "Total kasus per bulan", color = "Penyakit") +
theme_green()dispersion_check <- nyc_disease %>%
dplyr::group_by(Disease) %>%
dplyr::summarise(
Mean = round(mean(Count), 1),
Variance = round(var(Count), 1),
Ratio = round(var(Count) / mean(Count), 2),
Status = ifelse(var(Count) / mean(Count) > 1.5,
"Overdispersion — pertimbangkan Neg.Bin.",
"Mendekati equidispersion"),
.groups = "drop"
)
dispersion_check %>%
knitr::kable(caption = "Pemeriksaan Equidispersion: Rasio Var/Mean per Penyakit") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Disease | Mean | Variance | Ratio | Status |
|---|---|---|---|---|
| Hepatitis A | 6.4 | 8.3 | 1.29 | Mendekati equidispersion |
| Influenza | 156.1 | 24329.8 | 155.90 | Overdispersion — pertimbangkan Neg.Bin. |
| Lyme Disease | 19.3 | 231.6 | 12.00 | Overdispersion — pertimbangkan Neg.Bin. |
| Pertussis | 12.0 | 25.6 | 2.13 | Overdispersion — pertimbangkan Neg.Bin. |
| Salmonellosis | 33.7 | 211.4 | 6.28 | Overdispersion — pertimbangkan Neg.Bin. |
| Varicella | 30.0 | 337.1 | 11.25 | Overdispersion — pertimbangkan Neg.Bin. |
Interpretasi: Rasio Var/Mean mendekati 1 menunjukkan equidispersion (model Poisson sesuai). Rasio > 1,5 mengindikasikan overdispersion — pertimbangkan model Quasi-Poisson atau Negative Binomial.
flu_data <- nyc_disease %>% dplyr::filter(Disease == "Influenza")
flu_data %>%
dplyr::group_by(Season) %>%
dplyr::summarise(
N_minggu = dplyr::n(),
Rata_rata = round(mean(Count), 1),
Std_Dev = round(sd(Count), 1),
.groups = "drop"
) %>%
knitr::kable(caption = "Statistik Kasus Influenza per Musim (NYC, 2015–2019)") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)| Season | N_minggu | Rata_rata | Std_Dev |
|---|---|---|---|
| Winter | 60 | 421.0 | 18.5 |
| Spring | 60 | 82.6 | 8.3 |
| Summer | 60 | 24.8 | 4.6 |
| Fall | 60 | 95.8 | 8.3 |
fit_poisson <- glm(
Count ~ Season + Month_f + Year_c,
data = flu_data,
family = poisson(link = "log")
)
summary(fit_poisson)##
## Call:
## glm(formula = Count ~ Season + Month_f + Year_c, family = poisson(link = "log"),
## data = flu_data)
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.033665 0.010947 551.179 <2e-16 ***
## SeasonSpring -1.620887 0.026942 -60.161 <2e-16 ***
## SeasonSummer -2.795007 0.045614 -61.276 <2e-16 ***
## SeasonFall -1.474559 0.025365 -58.133 <2e-16 ***
## Month_f2 0.016991 0.015416 1.102 0.270
## Month_f3 -0.016499 0.034960 -0.472 0.637
## Month_f4 0.019803 0.034644 0.572 0.568
## Month_f5 NA NA NA NA
## Month_f6 -0.064800 0.063662 -1.018 0.309
## Month_f7 -0.017805 0.062903 -0.283 0.777
## Month_f8 NA NA NA NA
## Month_f9 0.015069 0.032238 0.467 0.640
## Month_f10 -0.004197 0.032393 -0.130 0.897
## Month_f11 NA NA NA NA
## Month_f12 0.009541 0.015444 0.618 0.537
## Year_c -0.004486 0.003654 -1.228 0.220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 32929.16 on 239 degrees of freedom
## Residual deviance: 184.11 on 227 degrees of freedom
## AIC: 1744.1
##
## Number of Fisher Scoring iterations: 4
Interpretasi: Model Poisson diestimasi menggunakan fungsi link log. Koefisien menunjukkan perubahan log-rata-rata kasus per minggu untuk setiap satuan perubahan prediktor. Null deviance yang jauh lebih besar dari residual deviance menunjukkan bahwa prediktor secara kolektif meningkatkan fit model.
\[H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \quad \text{vs} \quad H_1: \text{minimal satu } \beta_j \neq 0\]
\[G^2 = D_{\text{null}} - D_{\text{residual}} \sim \chi^2_{df}\]
G2_pois <- fit_poisson$null.deviance - fit_poisson$deviance
df_G2_pois <- fit_poisson$df.null - fit_poisson$df.residual
p_G2_pois <- pchisq(G2_pois, df = df_G2_pois, lower.tail = FALSE)
tibble::tibble(
`Null Deviance` = round(fit_poisson$null.deviance, 3),
`Residual Deviance`= round(fit_poisson$deviance, 3),
`G^2` = round(G2_pois, 3),
`df` = df_G2_pois,
`p-value` = formatC(p_G2_pois, format = "e", digits = 4),
`Keputusan` = ifelse(p_G2_pois < 0.05,
"Tolak H_0 — Model Signifikan", "Gagal Tolak H_0")
) %>%
knitr::kable(caption = "Hasil Uji G (Likelihood Ratio Test — Simultan)") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Null Deviance | Residual Deviance | G^2 | df | p-value | Keputusan |
|---|---|---|---|---|---|
| 32929.16 | 184.113 | 32745.04 | 12 | 0.0000e+00 | Tolak H_0 — Model Signifikan |
Interpretasi: Uji G menghasilkan p-value yang sangat kecil → H₀ ditolak: secara simultan, minimal satu variabel prediktor (musim/bulan/tahun) berpengaruh nyata terhadap rata-rata jumlah kasus influenza per minggu.
\[W^2_j = \left(\frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}\right)^2 \sim \chi^2_1\]
wald_pois <- broom::tidy(fit_poisson) %>%
dplyr::filter(term != "(Intercept)") %>%
dplyr::mutate(
W2 = round((estimate / std.error)^2, 3),
p_value = round(p.value, 4),
Keterangan = ifelse(p.value < 0.05, "Signifikan", "Tidak Signifikan")
) %>%
dplyr::select(term, estimate, std.error, W2, p_value, Keterangan) %>%
dplyr::mutate(across(c(estimate, std.error), ~ round(.x, 4)))
wald_pois %>%
knitr::kable(
caption = "Hasil Uji Wald (Parsial per Prediktor)",
col.names = c("Variabel", "Estimate", "SE", "W²", "p-value", "Keterangan")
) %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Variabel | Estimate | SE | W² | p-value | Keterangan |
|---|---|---|---|---|---|
| SeasonSpring | -1.6209 | 0.0269 | 3619.371 | 0.0000 | Signifikan |
| SeasonSummer | -2.7950 | 0.0456 | 3754.688 | 0.0000 | Signifikan |
| SeasonFall | -1.4746 | 0.0254 | 3379.470 | 0.0000 | Signifikan |
| Month_f2 | 0.0170 | 0.0154 | 1.215 | 0.2704 | Tidak Signifikan |
| Month_f3 | -0.0165 | 0.0350 | 0.223 | 0.6370 | Tidak Signifikan |
| Month_f4 | 0.0198 | 0.0346 | 0.327 | 0.5676 | Tidak Signifikan |
| Month_f5 | NA | NA | NA | NA | NA |
| Month_f6 | -0.0648 | 0.0637 | 1.036 | 0.3087 | Tidak Signifikan |
| Month_f7 | -0.0178 | 0.0629 | 0.080 | 0.7771 | Tidak Signifikan |
| Month_f8 | NA | NA | NA | NA | NA |
| Month_f9 | 0.0151 | 0.0322 | 0.218 | 0.6402 | Tidak Signifikan |
| Month_f10 | -0.0042 | 0.0324 | 0.017 | 0.8969 | Tidak Signifikan |
| Month_f11 | NA | NA | NA | NA | NA |
| Month_f12 | 0.0095 | 0.0154 | 0.382 | 0.5367 | Tidak Signifikan |
| Year_c | -0.0045 | 0.0037 | 1.507 | 0.2196 | Tidak Signifikan |
Interpretasi koefisien regresi Poisson dilakukan melalui Incidence Rate Ratio (IRR):
\[IRR_j = \exp(\hat{\beta}_j), \quad \text{Perubahan (\%)} = \left[\exp(\hat{\beta}_j) - 1\right] \times 100\%\]
result_poisson <- broom::tidy(fit_poisson) %>%
dplyr::mutate(
IRR = exp(estimate),
CI_low = exp(estimate - 1.96 * std.error),
CI_high = exp(estimate + 1.96 * std.error),
Perubahan_pct = round((IRR - 1) * 100, 2),
Sig = dplyr::case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
TRUE ~ ""
),
Signifikan = ifelse(p.value < 0.05, "Ya", "Tidak")
)
result_poisson %>%
dplyr::mutate(dplyr::across(
c(estimate, std.error, statistic, p.value, IRR, CI_low, CI_high),
~ round(.x, 4)
)) %>%
knitr::kable(
caption = "Hasil Regresi Poisson — Kasus Influenza per Minggu",
col.names = c("Parameter", "β", "SE", "z", "p-value",
"IRR", "CI 2.5%", "CI 97.5%", "Perubahan (%)", "Sig", "Signifikan")
) %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE) %>%
kableExtra::row_spec(
which(result_poisson$Signifikan == "Ya"),
background = "#e8faf0"
)| Parameter | β | SE | z | p-value | IRR | CI 2.5% | CI 97.5% | Perubahan (%) | Sig | Signifikan |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 6.0337 | 0.0109 | 551.1787 | 0.0000 | 417.2416 | 408.3847 | 426.2906 | 41624.16 | *** | Ya |
| SeasonSpring | -1.6209 | 0.0269 | -60.1612 | 0.0000 | 0.1977 | 0.1876 | 0.2084 | -80.23 | *** | Ya |
| SeasonSummer | -2.7950 | 0.0456 | -61.2755 | 0.0000 | 0.0611 | 0.0559 | 0.0668 | -93.89 | *** | Ya |
| SeasonFall | -1.4746 | 0.0254 | -58.1332 | 0.0000 | 0.2289 | 0.2178 | 0.2405 | -77.11 | *** | Ya |
| Month_f2 | 0.0170 | 0.0154 | 1.1022 | 0.2704 | 1.0171 | 0.9869 | 1.0483 | 1.71 | Tidak | |
| Month_f3 | -0.0165 | 0.0350 | -0.4719 | 0.6370 | 0.9836 | 0.9185 | 1.0534 | -1.64 | Tidak | |
| Month_f4 | 0.0198 | 0.0346 | 0.5716 | 0.5676 | 1.0200 | 0.9530 | 1.0917 | 2.00 | Tidak | |
| Month_f5 | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| Month_f6 | -0.0648 | 0.0637 | -1.0179 | 0.3087 | 0.9373 | 0.8273 | 1.0618 | -6.27 | Tidak | |
| Month_f7 | -0.0178 | 0.0629 | -0.2830 | 0.7771 | 0.9824 | 0.8684 | 1.1112 | -1.76 | Tidak | |
| Month_f8 | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| Month_f9 | 0.0151 | 0.0322 | 0.4674 | 0.6402 | 1.0152 | 0.9530 | 1.0814 | 1.52 | Tidak | |
| Month_f10 | -0.0042 | 0.0324 | -0.1296 | 0.8969 | 0.9958 | 0.9346 | 1.0611 | -0.42 | Tidak | |
| Month_f11 | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| Month_f12 | 0.0095 | 0.0154 | 0.6178 | 0.5367 | 1.0096 | 0.9795 | 1.0406 | 0.96 | Tidak | |
| Year_c | -0.0045 | 0.0037 | -1.2276 | 0.2196 | 0.9955 | 0.9884 | 1.0027 | -0.45 | Tidak |
Cara membaca IRR:
Interpretasi: Musim adalah prediktor dominan. Season Spring, Summer, dan Fall memiliki IRR < 1 relatif terhadap Winter (referensi), mengindikasikan jauh lebih sedikit kasus influenza di luar musim dingin. Tren tahunan (
Year_c) menunjukkan perubahan rata-rata kasus per tahun.
result_poisson %>%
dplyr::filter(term != "(Intercept)") %>%
ggplot(aes(x = IRR, y = reorder(term, IRR), color = Signifikan)) +
geom_point(size = 4.5) +
geom_errorbarh(aes(xmin = CI_low, xmax = CI_high), height = 0.30) +
geom_vline(xintercept = 1, linetype = "dashed",
color = "#b84f5a", linewidth = 0.85) +
scale_color_manual(values = c("Ya" = "#1a7a45", "Tidak" = "#b84f5a")) +
labs(
title = "Incidence Rate Ratio (IRR) dengan 95% Confidence Interval",
subtitle = "Garis merah putus = IRR 1 (tidak ada efek). Hijau = signifikan (p < 0,05)",
x = "IRR",
y = "Prediktor",
color = "Signifikan (p < 0,05)"
) +
theme_green()Interpretasi: Forest plot IRR menampilkan arah dan besaran pengaruh setiap prediktor beserta ketidakpastian estimasinya. Posisi titik di sebelah kiri IRR = 1 menunjukkan efek penurunan kasus.
\[\hat{\phi} = \frac{\sum_{i=1}^{n} r_{P,i}^2}{df_{\text{residual}}}\]
disp_pearson_flu <- sum(residuals(fit_poisson, type = "pearson")^2) / df.residual(fit_poisson)
tibble::tibble(
`Dispersi Pearson (phi^)` = round(disp_pearson_flu, 4),
`df Residual` = df.residual(fit_poisson),
`Interpretasi` = dplyr::case_when(
disp_pearson_flu > 1.5 ~ "Overdispersion — pertimbangkan Quasi-Poisson atau NB",
disp_pearson_flu < 0.85 ~ "Underdispersion — pertimbangkan Quasi-Poisson",
TRUE ~ "Equidispersion — model Poisson sesuai"
)
) %>%
knitr::kable(caption = "Pemeriksaan Equidispersion: Dispersi Pearson") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Dispersi Pearson (phi^) | df Residual | Interpretasi |
|---|---|---|
| 0.8085 | 227 | Underdispersion — pertimbangkan Quasi-Poisson |
Interpretasi: Nilai dispersi Pearson mendekati atau melebihi 1 mengindikasikan perlunya evaluasi lebih lanjut. Jika nilai jauh dari 1, model Quasi-Poisson dapat digunakan untuk koreksi standard error.
res_df_flu <- data.frame(
fitted = fitted(fit_poisson),
pearson = residuals(fit_poisson, type = "pearson"),
deviance = residuals(fit_poisson, type = "deviance")
)
p_res1 <- ggplot(res_df_flu, aes(x = fitted, y = pearson)) +
geom_point(alpha = 0.30, color = "#1a7a45", size = 1.2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "#b84f5a") +
geom_smooth(method = "loess", se = FALSE, color = "#52c98a", linewidth = 0.9) +
labs(title = "Residual Pearson vs Fitted", x = "Nilai Fitted", y = "Residual Pearson") +
theme_green()
p_res2 <- ggplot(res_df_flu, aes(x = fitted, y = deviance)) +
geom_point(alpha = 0.30, color = "#0d9e6a", size = 1.2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "#b84f5a") +
geom_smooth(method = "loess", se = FALSE, color = "#52c98a", linewidth = 0.9) +
labs(title = "Residual Deviance vs Fitted", x = "Nilai Fitted", y = "Residual Deviance") +
theme_green()
gridExtra::grid.arrange(p_res1, p_res2, ncol = 2)Interpretasi: Plot residual digunakan untuk mendeteksi pola sistematik yang mengindikasikan misfit model. Pola acak di sekitar garis nol menunjukkan model yang baik. Kurva loess membantu mendeteksi tren nonlinear yang mungkin terlewat.
flu_pred <- flu_data %>%
dplyr::mutate(
fitted = fitted(fit_poisson),
resid_p = residuals(fit_poisson, type = "pearson")
)
flu_pred %>%
dplyr::group_by(Year, Month) %>%
dplyr::summarise(
Aktual = sum(Count),
Fitted = sum(fitted),
.groups = "drop"
) %>%
dplyr::mutate(YM = Year + (Month - 1) / 12) %>%
tidyr::pivot_longer(cols = c(Aktual, Fitted),
names_to = "Tipe", values_to = "Kasus") %>%
ggplot(aes(x = YM, y = Kasus, color = Tipe, linetype = Tipe)) +
geom_line(linewidth = 1.15) +
geom_point(data = . %>% dplyr::filter(Tipe == "Aktual"),
size = 2.0, alpha = 0.7) +
scale_color_manual(values = c("Aktual" = "#0a2e1a", "Fitted" = "#52c98a")) +
scale_linetype_manual(values = c("Aktual" = "solid", "Fitted" = "dashed")) +
labs(title = "Nilai Aktual vs Fitted — Model Poisson (Influenza)",
subtitle = "NYC DOHMH, 2015–2019. Garis putus-putus = prediksi model.",
x = "Tahun", y = "Total kasus per bulan",
color = NULL, linetype = NULL) +
theme_green()season_newdata <- expand.grid(
Season = levels(flu_data$Season),
Month_f = factor(6, levels = levels(flu_data$Month_f)),
Year_c = 0
)
season_pred <- season_newdata %>%
dplyr::mutate(
Fitted = predict(fit_poisson, newdata = season_newdata, type = "response"),
CI_low = Fitted * exp(-1.96 * sqrt(1 / Fitted)),
CI_high = Fitted * exp(+1.96 * sqrt(1 / Fitted))
)
ggplot(season_pred, aes(x = Season, y = Fitted, fill = Season)) +
geom_col(width = 0.62, alpha = 0.92) +
geom_errorbar(aes(ymin = CI_low, ymax = CI_high),
width = 0.18, color = "#0a2e1a", linewidth = 0.9) +
geom_text(aes(label = round(Fitted, 0)),
vjust = -0.8, fontface = "bold", color = "#0a2e1a") +
scale_fill_manual(values = c("Winter" = "#0a2e1a", "Spring" = "#1a7a45",
"Summer" = "#52c98a", "Fall" = "#0d9e6a")) +
labs(title = "Estimasi Rata-rata Kasus Influenza per Minggu menurut Musim",
subtitle = "Prediksi regresi Poisson — bulan Juni, tahun 2017 (baseline).",
x = "Musim", y = "Rata-rata kasus per minggu (prediksi)",
caption = "Error bar: 95% CI aproksimasi") +
theme_green() +
theme(legend.position = "none")Interpretasi: Grafik menampilkan estimasi rata-rata kasus influenza per minggu berdasarkan musim. Musim dingin (Winter) secara konsisten menunjukkan kasus tertinggi, sedangkan musim panas (Summer) menunjukkan kasus terendah — pola yang konsisten dengan karakteristik epidemiologi influenza.
model_quasi <- glm(
Count ~ Season + Month_f + Year_c,
data = flu_data,
family = quasipoisson(link = "log")
)
# Perbandingan Standard Error
se_pois <- sqrt(diag(vcov(fit_poisson)))
se_quasi <- sqrt(diag(vcov(model_quasi)))
tibble::tibble(
Parameter = names(se_pois),
`SE Poisson` = round(se_pois, 5),
`SE Quasi-Poisson` = round(se_quasi, 5),
`Rasio SE (Q/P)` = round(se_quasi / se_pois, 4)
) %>%
knitr::kable(caption = "Perbandingan Standard Error: Poisson vs Quasi-Poisson") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Parameter | SE Poisson | SE Quasi-Poisson | Rasio SE (Q/P) |
|---|---|---|---|
| (Intercept) | 0.01095 | 0.00984 | 0.8992 |
| SeasonSpring | 0.02694 | 0.02423 | 0.8992 |
| SeasonSummer | 0.04561 | 0.04102 | 0.8992 |
| SeasonFall | 0.02537 | 0.02281 | 0.8992 |
| Month_f2 | 0.01542 | 0.01386 | 0.8992 |
| Month_f3 | 0.03496 | 0.03144 | 0.8992 |
| Month_f4 | 0.03464 | 0.03115 | 0.8992 |
| Month_f5 | NA | NA | NA |
| Month_f6 | 0.06366 | 0.05724 | 0.8992 |
| Month_f7 | 0.06290 | 0.05656 | 0.8992 |
| Month_f8 | NA | NA | NA |
| Month_f9 | 0.03224 | 0.02899 | 0.8992 |
| Month_f10 | 0.03239 | 0.02913 | 0.8992 |
| Month_f11 | NA | NA | NA |
| Month_f12 | 0.01544 | 0.01389 | 0.8992 |
| Year_c | 0.00365 | 0.00329 | 0.8992 |
Interpretasi: Model Quasi-Poisson menghasilkan koefisien yang identik dengan Poisson biasa, namun standard error-nya disesuaikan berdasarkan nilai dispersi \(\hat{\phi}\). Jika rasio SE (Q/P) ≈ 1, tidak ada masalah dispersi. Jika > 1, overdispersion perlu ditangani.
deviance_val <- deviance(fit_poisson)
df_residual <- df.residual(fit_poisson)
p_deviance <- 1 - pchisq(deviance_val, df_residual)
pearson_chi2 <- sum(residuals(fit_poisson, type = "pearson")^2)
p_pearson <- 1 - pchisq(pearson_chi2, df_residual)
disp_ratio <- pearson_chi2 / df_residual
tibble::tibble(
Ukuran = c("Deviance residual", "Pearson chi-squared",
"Rasio dispersi (Pearson/df)"),
Nilai = round(c(deviance_val, pearson_chi2, disp_ratio), 3),
`Derajat Bebas` = c(df_residual, df_residual, NA_real_),
`p-value` = round(c(p_deviance, p_pearson, NA_real_), 4),
Keterangan = c(
ifelse(p_deviance > 0.05, "Model sesuai", "Indikasi ketidaksesuaian"),
ifelse(p_pearson > 0.05, "Model sesuai", "Indikasi ketidaksesuaian"),
ifelse(disp_ratio < 1.5, "Equidispersion OK",
"Pertimbangkan Quasi-Poisson/NB")
)
) %>%
knitr::kable(caption = "Ukuran Goodness-of-Fit Model Poisson") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Ukuran | Nilai | Derajat Bebas | p-value | Keterangan |
|---|---|---|---|---|
| Deviance residual | 184.113 | 227 | 0.9832 | Model sesuai |
| Pearson chi-squared | 183.537 | 227 | 0.9844 | Model sesuai |
| Rasio dispersi (Pearson/df) | 0.809 | NA | NA | Equidispersion OK |
Catatan: Jika rasio dispersi > 2, terdapat overdispersion. Solusi: (1) Quasi-Poisson:
family = quasipoisson; (2) Negative Binomial:MASS::glm.nb().
Ringkasan Temuan Utama:
Uji G: Model secara simultan signifikan — minimal satu prediktor berpengaruh nyata terhadap rata-rata jumlah kasus influenza mingguan.
Uji Wald parsial mengidentifikasi musim sebagai prediktor dominan. Kasus influenza di musim dingin jauh lebih tinggi dibandingkan musim lainnya — konsisten dengan pola epidemiologi influenza global.
Dispersi Pearson: Perlu diperiksa apakah model Poisson standar sudah memadai atau perlu digunakan Quasi-Poisson untuk standard error yang lebih akurat.
Interpretasi IRR: Setiap prediktor yang signifikan memberikan kontribusi terukur terhadap perubahan rate kasus, dengan arah dan besaran yang dapat dikuantifikasi melalui IRR dan interval kepercayaannya.
Model Poisson lebih sesuai untuk data hitung dibandingkan OLS biasa karena konsistensi target model, prediksi nonnegatif, dan kemudahan interpretasi koefisien sebagai IRR.
tibble::tibble(
Aspek = c(
"Jenis variabel respons",
"Distribusi asumsi",
"Fungsi link",
"Jumlah persamaan",
"Interpretasi koefisien",
"Asumsi khas",
"Fungsi R",
"Dataset contoh"
),
`Regresi Biner` = c(
"Dikotomi (0/1)",
"Binomial",
"Logit",
"1 persamaan",
"Odds Ratio (OR = exp(β))",
"Independensi observasi",
"glm(family = binomial)",
"German Credit (n=1.000)"
),
`Regresi Multinomial` = c(
"Kategorik nominal (≥3 kategori)",
"Multinomial",
"Log (vs kategori referensi)",
"K−1 persamaan logit",
"Relative Risk Ratio (RRR = exp(β))",
"IIA — Independence of Irrelevant Alternatives",
"nnet::multinom()",
"NHANES (n=~5.000)"
),
`Regresi Ordinal` = c(
"Kategorik ordinal (urutan bermakna)",
"Multinomial dengan CDF logistik",
"Cumulative logit",
"J−1 threshold, satu vektor β",
"Odds Ratio kumulatif (OR = exp(β))",
"Proportional Odds (parallel lines)",
"MASS::polr()",
"Student Performance (n=649)"
),
`Regresi Poisson` = c(
"Data cacahan (0, 1, 2, ...)",
"Poisson",
"Log",
"Satu persamaan",
"Incidence Rate Ratio (IRR = exp(β))",
"Equidispersion: E(Y) = Var(Y)",
"glm(family = poisson)",
"NYC DOHMH Influenza (2015–2019)"
)
) %>%
knitr::kable(caption = "Perbandingan Regresi Logistik Biner, Multinomial, Ordinal, dan Poisson") %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Aspek | Regresi Biner | Regresi Multinomial | Regresi Ordinal | Regresi Poisson |
|---|---|---|---|---|
| Jenis variabel respons | Dikotomi (0/1) | Kategorik nominal (≥3 kategori) | Kategorik ordinal (urutan bermakna) | Data cacahan (0, 1, 2, …) |
| Distribusi asumsi | Binomial | Multinomial | Multinomial dengan CDF logistik | Poisson |
| Fungsi link | Logit | Log (vs kategori referensi) | Cumulative logit | Log |
| Jumlah persamaan | 1 persamaan | K−1 persamaan logit | J−1 threshold, satu vektor β | Satu persamaan |
| Interpretasi koefisien | Odds Ratio (OR = exp(β)) | Relative Risk Ratio (RRR = exp(β)) | Odds Ratio kumulatif (OR = exp(β)) | Incidence Rate Ratio (IRR = exp(β)) |
| Asumsi khas | Independensi observasi | IIA — Independence of Irrelevant Alternatives | Proportional Odds (parallel lines) | Equidispersion: E(Y) = Var(Y) |
| Fungsi R | glm(family = binomial) | nnet::multinom() | MASS::polr() | glm(family = poisson) |
| Dataset contoh | German Credit (n=1.000) | NHANES (n=~5.000) | Student Performance (n=649) | NYC DOHMH Influenza (2015–2019) |