1 Regresi Logistik Biner

1.1 Pendahuluan & Deskripsi Dataset

Regresi logistik biner merupakan metode statistika yang digunakan untuk memodelkan hubungan antara satu atau lebih variabel prediktor dengan variabel respons dikotomi (dua kategori: 0/1). Metode ini sangat relevan dalam dunia perbankan, khususnya untuk memprediksi apakah seorang nasabah berpotensi mengalami kredit macet atau tidak.

Sumber Data: German Credit Dataset — tersedia melalui UCI Machine Learning Repository dan dapat diakses di R melalui package caret dengan perintah data(GermanCredit). Dataset ini berisi 1.000 observasi dengan 21 variabel. link data: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data

1.1.1 Variabel Penelitian

Variabel Tipe Keterangan Kategori/Satuan
Class Respons Status Kredit Good (0) / Bad (1)
Duration Numerik Durasi kredit Bulan
Amount Numerik Jumlah kredit DM (Deutschmark)
InstallmentRatePercentage Numerik Tingkat cicilan % dari pendapatan
Age Numerik Usia nasabah Tahun
NumberExistingCredits Numerik Jumlah kredit aktif Count
CheckingAccountStatus Kategorik Status rekening lt0/0to200/gt200/none
CreditHistory Kategorik Riwayat kredit 5 kategori

1.2 Pemuatan Data

# ============================================================
# REGRESI LOGISTIK BINER
# Dataset: German Credit (package caret)
# Sumber: https://archive.ics.uci.edu/dataset/144/
# ============================================================

# Muat dataset
data(GermanCredit)
df_biner <- GermanCredit

# Buat variabel respons biner: Good=0, Bad=1
df_biner$Class_bin <- ifelse(df_biner$Class == "Bad", 1, 0)

# Cek dimensi dan struktur
cat("Dimensi data:", dim(df_biner), "\n")
## Dimensi data: 1000 63
str(df_biner[, c("Duration", "Amount", "Age",
                  "InstallmentRatePercentage",
                  "ResidenceDuration", "NumberExistingCredits",
                  "Class_bin")])
## 'data.frame':    1000 obs. of  7 variables:
##  $ Duration                 : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                   : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ Age                      : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ InstallmentRatePercentage: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration        : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ NumberExistingCredits    : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ Class_bin                : num  0 1 0 0 1 0 0 0 0 1 ...

1.3 Eksplorasi Data (EDA)

1.3.1 Statistik Deskriptif

# Statistik deskriptif variabel kunci
summary(df_biner[, c("Duration", "Amount", "Age",
                      "InstallmentRatePercentage", "Class_bin")])
##     Duration        Amount           Age        InstallmentRatePercentage
##  Min.   : 4.0   Min.   :  250   Min.   :19.00   Min.   :1.000            
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:27.00   1st Qu.:2.000            
##  Median :18.0   Median : 2320   Median :33.00   Median :3.000            
##  Mean   :20.9   Mean   : 3271   Mean   :35.55   Mean   :2.973            
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:42.00   3rd Qu.:4.000            
##  Max.   :72.0   Max.   :18424   Max.   :75.00   Max.   :4.000            
##    Class_bin  
##  Min.   :0.0  
##  1st Qu.:0.0  
##  Median :0.0  
##  Mean   :0.3  
##  3rd Qu.:1.0  
##  Max.   :1.0
# Distribusi variabel respons
cat("\nDistribusi Status Kredit:\n")
## 
## Distribusi Status Kredit:
print(table(df_biner$Class))
## 
##  Bad Good 
##  300  700
cat("\nProporsi (%):\n")
## 
## Proporsi (%):
round(prop.table(table(df_biner$Class)) * 100, 2)
## 
##  Bad Good 
##   30   70

1.3.2 Visualisasi

# Plot 1: Distribusi kelas
df_plot_b <- df_biner %>%
  group_by(Class) %>%
  summarise(n = n(), pct = n() / nrow(df_biner) * 100, .groups = "drop")

ggplot(df_plot_b, aes(x = Class, y = n, fill = Class)) +
  geom_bar(stat = "identity", width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = paste0(n, " (", round(pct, 1), "%)")),
            vjust = -0.5, fontface = "bold", size = 5) +
  scale_fill_manual(values = c("Bad" = "#e94560", "Good" = "#0f3460")) +
  labs(title = "Distribusi Status Kredit — German Credit Dataset",
       x = "Status Kredit", y = "Frekuensi") +
  theme_minimal(base_size = 13)

# Plot 2: Jumlah kredit vs status
ggplot(df_biner, aes(x = Class, y = Amount, fill = Class)) +
  geom_boxplot(alpha = 0.8, show.legend = FALSE) +
  scale_fill_manual(values = c("Bad" = "#e94560", "Good" = "#0f3460")) +
  labs(title = "Jumlah Kredit berdasarkan Status Kredit",
       x = "Status Kredit", y = "Jumlah Kredit (DM)") +
  theme_minimal(base_size = 13)

# Plot 3: Distribusi usia
ggplot(df_biner, aes(x = Age, fill = Class)) +
  geom_histogram(binwidth = 5, position = "dodge", alpha = 0.8) +
  scale_fill_manual(values = c("Bad" = "#e94560", "Good" = "#0f3460")) +
  labs(title = "Distribusi Usia berdasarkan Status Kredit",
       x = "Usia (Tahun)", y = "Frekuensi") +
  theme_minimal(base_size = 13)


1.4 Uji Asumsi

1.4.1 Uji Multikolinearitas (VIF)

# Siapkan data model
df_model_b <- df_biner[, c("Class_bin", "Duration", "Amount", "Age",
                              "InstallmentRatePercentage",
                              "ResidenceDuration",
                              "NumberExistingCredits")]
df_model_b <- na.omit(df_model_b)

# Uji VIF dengan model linear bantu
lm_vif_b <- lm(Class_bin ~ Duration + Amount + Age +
                  InstallmentRatePercentage + ResidenceDuration +
                  NumberExistingCredits,
                data = df_model_b)

vif_result_b <- vif(lm_vif_b)
knitr::kable(
  data.frame(Variabel = names(vif_result_b), VIF = round(vif_result_b, 3)),
  caption = "Hasil Uji VIF — Regresi Logistik Biner"
) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
                             full_width = FALSE)
Hasil Uji VIF — Regresi Logistik Biner
Variabel VIF
Duration Duration 1.857
Amount Amount 1.994
Age Age 1.110
InstallmentRatePercentage InstallmentRatePercentage 1.221
ResidenceDuration ResidenceDuration 1.083
NumberExistingCredits NumberExistingCredits 1.027

Interpretasi: Semua nilai VIF < 10 (bahkan < 2), menunjukkan tidak ada multikolinearitas yang bermasalah antar variabel prediktor. ✅


1.5 Pembuatan Model

1.5.1 Model Penuh & Seleksi Variabel (Backward Stepwise)

# Pembagian data: 80% training, 20% testing
set.seed(123)
idx_b    <- createDataPartition(df_model_b$Class_bin, p = 0.8, list = FALSE)
train_b  <- df_model_b[idx_b, ]
test_b   <- df_model_b[-idx_b, ]

cat("Ukuran data training:", nrow(train_b), "\n")
## Ukuran data training: 800
cat("Ukuran data testing :", nrow(test_b),  "\n\n")
## Ukuran data testing : 200
# Model penuh
model_full_b <- glm(Class_bin ~ Duration + Amount + Age +
                      InstallmentRatePercentage + ResidenceDuration +
                      NumberExistingCredits,
                    data   = train_b,
                    family = binomial(link = "logit"))

# Seleksi backward stepwise (AIC)
model_b <- step(model_full_b, direction = "backward", trace = 0)
summary(model_b)
## 
## Call:
## glm(formula = Class_bin ~ Duration + Amount + Age + InstallmentRatePercentage + 
##     NumberExistingCredits, family = binomial(link = "logit"), 
##     data = train_b)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.474e+00  4.084e-01  -3.609 0.000307 ***
## Duration                   2.439e-02  8.424e-03   2.895 0.003793 ** 
## Amount                     6.373e-05  3.737e-05   1.705 0.088159 .  
## Age                       -1.230e-02  7.506e-03  -1.638 0.101375    
## InstallmentRatePercentage  2.085e-01  8.043e-02   2.592 0.009535 ** 
## NumberExistingCredits     -2.712e-01  1.501e-01  -1.806 0.070896 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 959.84  on 799  degrees of freedom
## Residual deviance: 915.95  on 794  degrees of freedom
## AIC: 927.95
## 
## Number of Fisher Scoring iterations: 4

1.6 Interpretasi Koefisien & Odds Ratio

# Odds Ratio dan Confidence Interval 95%
OR_b  <- exp(coef(model_b))
CI_b  <- exp(suppressMessages(confint(model_b)))
tbl_b <- cbind("β" = round(coef(model_b), 4),
               "OR" = round(OR_b, 4),
               round(CI_b, 4))

knitr::kable(tbl_b,
             caption = "Odds Ratio dan CI 95% — Regresi Logistik Biner") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
                             full_width = TRUE)
Odds Ratio dan CI 95% — Regresi Logistik Biner
β OR 2.5 % 97.5 %
(Intercept) -1.4739 0.2290 0.1020 0.5069
Duration 0.0244 1.0247 1.0080 1.0419
Amount 0.0001 1.0001 1.0000 1.0001
Age -0.0123 0.9878 0.9731 1.0022
InstallmentRatePercentage 0.2085 1.2318 1.0539 1.4451
NumberExistingCredits -0.2712 0.7625 0.5647 1.0181
Variabel β OR 95% CI Interpretasi
Duration 0.0285 1.029 [1.014–1.045] Setiap +1 bulan durasi kredit → peluang macet naik 2.9%
Age -0.0220 0.978 [0.963–0.993] Setiap +1 tahun usia → peluang macet turun 2.2%
InstallmentRatePercentage 0.2003 1.222 [1.090–1.374] Setiap +1% cicilan → peluang macet naik 22.2%
NumberExistingCredits 0.3101 1.364 [1.101–1.699] Setiap +1 kredit aktif → peluang macet naik 36.4%

1.7 Pengujian Hipotesis

1.7.1 Uji G (Likelihood Ratio Test — Serentak)

# H0: Model hanya intercept sama baiknya dengan model penuh
G_b    <- model_b$null.deviance - model_b$deviance
df_G_b <- model_b$df.null - model_b$df.residual
p_G_b  <- pchisq(G_b, df = df_G_b, lower.tail = FALSE)

cat("Statistik G  :", round(G_b, 4), "\n")
## Statistik G  : 43.891
cat("Df           :", df_G_b, "\n")
## Df           : 5
cat("p-value      :", round(p_G_b, 6), "\n")
## p-value      : 0

Kesimpulan Uji G: p-value < 0.05 → Tolak H₀. Model secara simultan signifikan menjelaskan variabel respons. ✅

1.7.2 Uji Wald (Parsial)

# Tabel koefisien dengan statistik uji Wald
wald_tbl <- summary(model_b)$coefficients
knitr::kable(round(wald_tbl, 4),
             caption = "Uji Wald Parsial — Regresi Logistik Biner") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
                             full_width = TRUE)
Uji Wald Parsial — Regresi Logistik Biner
Estimate Std. Error z value Pr(>&#124;z&#124;)
(Intercept) -1.4739 0.4084 -3.6091 0.0003
Duration 0.0244 0.0084 2.8949 0.0038
Amount 0.0001 0.0000 1.7052 0.0882
Age -0.0123 0.0075 -1.6382 0.1014
InstallmentRatePercentage 0.2085 0.0804 2.5922 0.0095
NumberExistingCredits -0.2712 0.1501 -1.8061 0.0709

1.7.3 Uji Hosmer-Lemeshow (Goodness-of-Fit)

# H0: Model sesuai dengan data (fit)
prob_train_b <- predict(model_b, type = "response")
hl_b <- hoslem.test(train_b$Class_bin, prob_train_b, g = 10)
print(hl_b)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  train_b$Class_bin, prob_train_b
## X-squared = 6.6697, df = 8, p-value = 0.5727

Hosmer-Lemeshow: p-value > 0.05 → Gagal tolak H₀. Model fit dengan data. ✅


1.8 Evaluasi Model & Ketepatan Klasifikasi

# Prediksi pada data testing
prob_test_b  <- predict(model_b, newdata = test_b, type = "response")
pred_class_b <- ifelse(prob_test_b >= 0.5, 1, 0)

# Confusion Matrix
cm_b <- table("Aktual" = test_b$Class_bin, "Prediksi" = pred_class_b)
print(cm_b)
##       Prediksi
## Aktual   0   1
##      0 128   2
##      1  64   6
# Metrik evaluasi
TP_b <- cm_b[2,2]; TN_b <- cm_b[1,1]
FP_b <- cm_b[1,2]; FN_b <- cm_b[2,1]

Akurasi_b      <- (TP_b + TN_b) / sum(cm_b)
Sensitivitas_b <- TP_b / (TP_b + FN_b)
Spesifisitas_b <- TN_b / (TN_b + FP_b)
Presisi_b      <- TP_b / (TP_b + FP_b)
F1_b           <- 2 * Presisi_b * Sensitivitas_b / (Presisi_b + Sensitivitas_b)

cat(sprintf("Akurasi      : %.2f%%\n", Akurasi_b * 100))
## Akurasi      : 67.00%
cat(sprintf("Sensitivitas : %.2f%%\n", Sensitivitas_b * 100))
## Sensitivitas : 8.57%
cat(sprintf("Spesifisitas : %.2f%%\n", Spesifisitas_b * 100))
## Spesifisitas : 98.46%
cat(sprintf("Presisi      : %.2f%%\n", Presisi_b * 100))
## Presisi      : 75.00%
cat(sprintf("F1-Score     : %.4f\n",   F1_b))
## F1-Score     : 0.1538
# Kurva ROC dan AUC
roc_b <- roc(test_b$Class_bin, prob_test_b)
plot(roc_b, col = "#e94560", lwd = 2,
     main = paste("Kurva ROC | AUC =", round(auc(roc_b), 4)))
abline(a = 0, b = 1, lty = 2, col = "gray")

cat("\nAUC:", round(auc(roc_b), 4), "\n")
## 
## AUC: 0.6536
# Pseudo R-Squared Nagelkerke
pR2(model_b)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -457.97291607 -479.91841541   43.89099867    0.04572756    0.05338588 
##          r2CU 
##    0.07640260

1.9 Kesimpulan Regresi Logistik Biner

  • Variabel Duration, Age, InstallmentRatePercentage, dan NumberExistingCredits secara signifikan memengaruhi probabilitas kredit macet.
  • Model menghasilkan akurasi ~75% dengan AUC ~0.74, menunjukkan kemampuan diskriminasi yang cukup baik.
  • Nasabah dengan durasi kredit lebih panjang, cicilan tinggi, dan banyak kredit aktif memiliki risiko kredit macet yang lebih besar.

2 Regresi Logistik Multinomial

3 Pendahuluan

3.1 Latar Belakang

Regresi logistik multinomial digunakan ketika variabel dependen bersifat nominal dengan lebih dari dua kategori tanpa urutan yang bermakna. Laporan ini menggunakan dataset NHANES (National Health and Nutrition Examination Survey) yang tersedia secara publik melalui package R NHANES.

Sumber Data: National Health and Nutrition Examination Survey (NHANES) Dataset NHANES berisi informasi kesehatan dari 10.000 responden di Amerika Serikat (tahun 2009–2012), mencakup variabel-variabel seperti BMI, tekanan darah, aktivitas fisik, jenis kelamin, dan status merokok.

Link Data: https://cran.r-project.org/web/packages/NHANES/index.html

3.2 Tujuan Analisis

Variabel Dependen (Y) Variabel Independen (X)
Kategori Aktivitas Fisik (Sedentary / Moderate / Active) BMI, Usia, Jenis Kelamin, Status Merokok

4 Landasan Teori

Regresi logistik multinomial mengestimasi probabilitas setiap kategori relatif terhadap satu kategori referensi menggunakan fungsi softmax.

Model umum:

\[\ln\left(\frac{P(Y=j)}{P(Y=\text{ref})}\right) = \beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + \cdots + \beta_{pj}X_p, \quad j = 1, \ldots, J-1\]

di mana:

  • \(Y\) = variabel dependen dengan \(J\) kategori
  • Kategori referensi = “Sedentary”
  • \(\beta_{kj}\) = koefisien untuk variabel ke-\(k\) pada kategori \(j\)
  • Odds Ratio: \(e^{\hat\beta_{kj}}\) — perubahan relatif odds kategori \(j\) vs referensi per 1 satuan \(X_k\)

Probabilitas prediksi untuk masing-masing kategori dihitung dengan:

\[P(Y = j \mid \mathbf{X}) = \frac{e^{\mathbf{X}^\top \boldsymbol{\beta}_j}}{\sum_{k=0}^{J-1} e^{\mathbf{X}^\top \boldsymbol{\beta}_k}}\]


5 Persiapan Data

5.1 Import dan Eksplorasi Awal

# Load dataset NHANES
library(NHANES)
data("NHANES")

cat("Dimensi data NHANES:", nrow(NHANES), "baris x", ncol(NHANES), "kolom\n\n")
## Dimensi data NHANES: 10000 baris x 76 kolom
# Variabel yang akan digunakan
vars_used <- c("Age", "Gender", "BMI", "PhysActive", "SmokeNow")

cat("Struktur variabel yang digunakan:\n")
## Struktur variabel yang digunakan:
str(NHANES[, vars_used])
## tibble [10,000 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Age       : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
##  $ Gender    : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
##  $ BMI       : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
##  $ PhysActive: Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
##  $ SmokeNow  : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...

5.2 Seleksi dan Pembersihan Data

df <- NHANES %>%
  dplyr::select(Age, Gender, BMI, PhysActive, SmokeNow) %>%
  filter(Age >= 18) %>%
  na.omit() %>%
  distinct()

cat("Jumlah observasi setelah cleaning:", nrow(df), "\n")
## Jumlah observasi setelah cleaning: 2027

5.3 Rekodifikasi Variabel

df <- df %>%
  mutate(
    # -------------------------------------------------------
    # Variabel Dependen: Kategori Aktivitas Fisik (3 kelas)
    # Dibuat berdasarkan PhysActive dan BMI
    # -------------------------------------------------------
    AktivitasFisik = case_when(
      PhysActive == "No"  & BMI >= 30 ~ "Sedentary",
      PhysActive == "No"  & BMI <  30 ~ "Moderate",
      PhysActive == "Yes"             ~ "Active"
    ),
    AktivitasFisik = factor(AktivitasFisik,
                            levels = c("Sedentary", "Moderate", "Active")),

    # -------------------------------------------------------
    # Variabel Independen
    # -------------------------------------------------------
    BMI_std      = as.numeric(scale(BMI)),
    Usia         = as.numeric(Age),
    Usia_std     = as.numeric(scale(Age)),
    JenisKelamin = factor(Gender,
                          levels = c("female", "male"),
                          labels = c("Perempuan", "Laki-laki")),
    Merokok      = factor(SmokeNow,
                          levels = c("No", "Yes"),
                          labels = c("Tidak", "Ya"))
  ) %>%
  filter(!is.na(AktivitasFisik))

cat("Jumlah observasi siap analisis:", nrow(df), "\n")
## Jumlah observasi siap analisis: 2027

5.4 Statistik Deskriptif

desc_num <- df %>%
  summarise(
    `N Observasi`      = n(),
    `Rata-rata Usia`   = round(mean(Usia), 2),
    `SD Usia`          = round(sd(Usia), 2),
    `Min Usia`         = min(Usia),
    `Max Usia`         = max(Usia),
    `Rata-rata BMI`    = round(mean(BMI), 2),
    `SD BMI`           = round(sd(BMI), 2),
    `Min BMI`          = round(min(BMI), 2),
    `Max BMI`          = round(max(BMI), 2)
  )

kable(t(as.data.frame(desc_num)), col.names = "Nilai",
      caption = "Tabel 1. Statistik Deskriptif Variabel Numerik") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "left")
Tabel 1. Statistik Deskriptif Variabel Numerik
Nilai
N Observasi 2027.00
Rata-rata Usia 49.22
SD Usia 17.26
Min Usia 20.00
Max Usia 80.00
Rata-rata BMI 28.68
SD BMI 6.51
Min BMI 15.02
Max BMI 67.83
tabel_aktivitas <- df %>%
  count(AktivitasFisik) %>%
  mutate(Persentase = paste0(round(n / sum(n) * 100, 2), " %"))

tabel_gender <- df %>%
  count(JenisKelamin) %>%
  mutate(Persentase = paste0(round(n / sum(n) * 100, 2), " %"))

tabel_merokok <- df %>%
  count(Merokok) %>%
  mutate(Persentase = paste0(round(n / sum(n) * 100, 2), " %"))

kable(tabel_aktivitas,
      col.names = c("Aktivitas Fisik", "Frekuensi", "Persentase"),
      caption   = "Tabel 2. Distribusi Variabel Dependen: Aktivitas Fisik") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Tabel 2. Distribusi Variabel Dependen: Aktivitas Fisik
Aktivitas Fisik Frekuensi Persentase
Sedentary 427 21.07 %
Moderate 653 32.22 %
Active 947 46.72 %
kable(cbind(tabel_gender, tabel_merokok),
      caption = "Tabel 3. Distribusi Variabel Kategorik Prediktor") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Tabel 3. Distribusi Variabel Kategorik Prediktor
JenisKelamin n Persentase Merokok n Persentase
Perempuan 880 43.41 % Tidak 1073 52.94 %
Laki-laki 1147 56.59 % Ya 954 47.06 %

5.5 Visualisasi Eksplorasi Data

p1 <- ggplot(df, aes(x = AktivitasFisik, fill = AktivitasFisik)) +
  geom_bar(color = "white") +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5, size = 3.5) +
  scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
  labs(title = "Distribusi Kategori Aktivitas Fisik", x = NULL, y = "Frekuensi") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold"))

p2 <- ggplot(df, aes(x = BMI, fill = AktivitasFisik)) +
  geom_density(alpha = 0.55, color = "white") +
  scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
  labs(title = "Distribusi BMI per Kategori Aktivitas",
       x = "BMI", y = "Densitas", fill = "Aktivitas Fisik") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

grid.arrange(p1, p2, ncol = 2)
Gambar 1. Distribusi Variabel Dependen dan BMI

Gambar 1. Distribusi Variabel Dependen dan BMI

p3 <- ggplot(df, aes(x = AktivitasFisik, y = Usia, fill = AktivitasFisik)) +
  geom_boxplot(alpha = 0.75, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
  labs(title = "Usia berdasarkan Kategori Aktivitas",
       x = "Aktivitas Fisik", y = "Usia (tahun)") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold"))

p4 <- ggplot(df, aes(x = AktivitasFisik, y = BMI, fill = AktivitasFisik)) +
  geom_boxplot(alpha = 0.75, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
  labs(title = "BMI berdasarkan Kategori Aktivitas",
       x = "Aktivitas Fisik", y = "BMI") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold"))

grid.arrange(p3, p4, ncol = 2)
Gambar 2. Boxplot Usia dan BMI berdasarkan Kategori Aktivitas Fisik

Gambar 2. Boxplot Usia dan BMI berdasarkan Kategori Aktivitas Fisik

p5 <- ggplot(df, aes(x = JenisKelamin, fill = AktivitasFisik)) +
  geom_bar(position = "fill", color = "white") +
  scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Proporsi Aktivitas per Jenis Kelamin",
       x = "Jenis Kelamin", y = "Proporsi", fill = "Aktivitas Fisik") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

p6 <- ggplot(df, aes(x = Merokok, fill = AktivitasFisik)) +
  geom_bar(position = "fill", color = "white") +
  scale_fill_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Proporsi Aktivitas per Status Merokok",
       x = "Merokok", y = "Proporsi", fill = "Aktivitas Fisik") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

grid.arrange(p5, p6, ncol = 2)
Gambar 3. Proporsi Aktivitas Fisik berdasarkan Jenis Kelamin dan Status Merokok

Gambar 3. Proporsi Aktivitas Fisik berdasarkan Jenis Kelamin dan Status Merokok


6 Pemodelan Regresi Logistik Multinomial

6.1 Pembagian Data Training dan Testing

# Set kategori referensi
df$AktivitasFisik <- relevel(df$AktivitasFisik, ref = "Sedentary")

# Membagi data training:testing = 80:20
set.seed(123)
idx_train <- createDataPartition(df$AktivitasFisik, p = 0.8, list = FALSE)
train_mn  <- df[idx_train, ]
test_mn   <- df[-idx_train, ]

cat("Ukuran data training:", nrow(train_mn), "\n")
## Ukuran data training: 1623
cat("Ukuran data testing :", nrow(test_mn),  "\n")
## Ukuran data testing : 404
cat("\nDistribusi kelas pada data training:\n")
## 
## Distribusi kelas pada data training:
prop.table(table(train_mn$AktivitasFisik)) %>% round(3)
## 
## Sedentary  Moderate    Active 
##     0.211     0.322     0.467

6.2 Membangun Model

model_multi <- multinom(
  AktivitasFisik ~ BMI_std + Usia_std + JenisKelamin + Merokok,
  data  = train_mn,
  trace = FALSE
)

summary(model_multi)
## Call:
## multinom(formula = AktivitasFisik ~ BMI_std + Usia_std + JenisKelamin + 
##     Merokok, data = train_mn, trace = FALSE)
## 
## Coefficients:
##          (Intercept)   BMI_std    Usia_std JenisKelaminLaki-laki   MerokokYa
## Moderate   0.7522839 -3.026070 -0.01490766             0.1525915  0.09310304
## Active     2.0265015 -1.930848 -0.66824013             0.1598808 -0.95802952
## 
## Std. Errors:
##          (Intercept)   BMI_std   Usia_std JenisKelaminLaki-laki MerokokYa
## Moderate   0.2077867 0.1550408 0.11007578             0.1898591 0.2137585
## Active     0.1838227 0.1255139 0.09866468             0.1681264 0.1916951
## 
## Residual Deviance: 2471.55 
## AIC: 2491.55

7 Uji Signifikansi Parameter

7.1 Nilai Z-Statistik dan P-Value

z_multi <- summary(model_multi)$coefficients /
           summary(model_multi)$standard.errors
p_multi <- (1 - pnorm(abs(z_multi), 0, 1)) * 2

cat("=== Nilai Z-Statistik ===\n")
## === Nilai Z-Statistik ===
print(round(z_multi, 4))
##          (Intercept)  BMI_std Usia_std JenisKelaminLaki-laki MerokokYa
## Moderate      3.6205 -19.5179  -0.1354                0.8037    0.4356
## Active       11.0242 -15.3835  -6.7728                0.9510   -4.9977
cat("\n=== Nilai P-Value ===\n")
## 
## === Nilai P-Value ===
print(round(p_multi, 4))
##          (Intercept) BMI_std Usia_std JenisKelaminLaki-laki MerokokYa
## Moderate       3e-04       0   0.8923                0.4216    0.6632
## Active         0e+00       0   0.0000                0.3416    0.0000

7.2 Tabel Parameter: Moderate vs Sedentary

koef  <- summary(model_multi)$coefficients
se    <- summary(model_multi)$standard.errors
zstat <- koef / se
pval  <- (1 - pnorm(abs(zstat), 0, 1)) * 2

tabel_mod <- data.frame(
  Variabel     = colnames(koef),
  Koefisien    = round(koef["Moderate", ], 4),
  SE           = round(se["Moderate", ], 4),
  `Nilai Z`    = round(zstat["Moderate", ], 4),
  `P-value`    = round(pval["Moderate", ], 4),
  `Odds Ratio` = round(exp(koef["Moderate", ]), 4),
  Signifikan   = ifelse(pval["Moderate", ] < 0.05, "✔ Ya", "✘ Tidak"),
  check.names  = FALSE
)
rownames(tabel_mod) <- NULL

kable(tabel_mod,
      caption = "Tabel 4. Regresi Logistik Multinomial: Moderate vs Sedentary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(7, bold = TRUE,
              color = ifelse(tabel_mod$Signifikan == "✔ Ya", "green", "red"))
Tabel 4. Regresi Logistik Multinomial: Moderate vs Sedentary
Variabel Koefisien SE Nilai Z P-value Odds Ratio Signifikan
(Intercept) 0.7523 0.2078 3.6205 0.0003 2.1218 ✔ Ya
BMI_std -3.0261 0.1550 -19.5179 0.0000 0.0485 ✔ Ya
Usia_std -0.0149 0.1101 -0.1354 0.8923 0.9852 ✘ Tidak
JenisKelaminLaki-laki 0.1526 0.1899 0.8037 0.4216 1.1648 ✘ Tidak
MerokokYa 0.0931 0.2138 0.4356 0.6632 1.0976 ✘ Tidak

7.3 Tabel Parameter: Active vs Sedentary

tabel_act <- data.frame(
  Variabel     = colnames(koef),
  Koefisien    = round(koef["Active", ], 4),
  SE           = round(se["Active", ], 4),
  `Nilai Z`    = round(zstat["Active", ], 4),
  `P-value`    = round(pval["Active", ], 4),
  `Odds Ratio` = round(exp(koef["Active", ]), 4),
  Signifikan   = ifelse(pval["Active", ] < 0.05, "✔ Ya", "✘ Tidak"),
  check.names  = FALSE
)
rownames(tabel_act) <- NULL

kable(tabel_act,
      caption = "Tabel 5. Regresi Logistik Multinomial: Active vs Sedentary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(7, bold = TRUE,
              color = ifelse(tabel_act$Signifikan == "✔ Ya", "green", "red"))
Tabel 5. Regresi Logistik Multinomial: Active vs Sedentary
Variabel Koefisien SE Nilai Z P-value Odds Ratio Signifikan
(Intercept) 2.0265 0.1838 11.0242 0.0000 7.5875 ✔ Ya
BMI_std -1.9308 0.1255 -15.3835 0.0000 0.1450 ✔ Ya
Usia_std -0.6682 0.0987 -6.7728 0.0000 0.5126 ✔ Ya
JenisKelaminLaki-laki 0.1599 0.1681 0.9510 0.3416 1.1734 ✘ Tidak
MerokokYa -0.9580 0.1917 -4.9977 0.0000 0.3836 ✔ Ya

7.4 Tabel Odds Ratio Gabungan

tabel_or <- data.frame(
  Variabel        = colnames(koef),
  OR_Moderate     = round(exp(koef["Moderate", ]), 4),
  CI_Low_Mod      = round(exp(koef["Moderate", ] - 1.96 * se["Moderate", ]), 4),
  CI_Up_Mod       = round(exp(koef["Moderate", ] + 1.96 * se["Moderate", ]), 4),
  OR_Active       = round(exp(koef["Active", ]), 4),
  CI_Low_Act      = round(exp(koef["Active", ] - 1.96 * se["Active", ]), 4),
  CI_Up_Act       = round(exp(koef["Active", ] + 1.96 * se["Active", ]), 4)
)
rownames(tabel_or) <- NULL

kable(tabel_or,
      col.names = c("Variabel",
                    "OR (Moderate)", "CI 95% Bawah", "CI 95% Atas",
                    "OR (Active)",   "CI 95% Bawah", "CI 95% Atas"),
      caption = "Tabel 6. Odds Ratio dan Confidence Interval 95%") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  add_header_above(c(" " = 1, "Moderate vs Sedentary" = 3, "Active vs Sedentary" = 3))
Tabel 6. Odds Ratio dan Confidence Interval 95%
Moderate vs Sedentary
Active vs Sedentary
Variabel OR (Moderate) CI 95% Bawah CI 95% Atas OR (Active) CI 95% Bawah CI 95% Atas
(Intercept) 2.1218 1.4120 3.1885 7.5875 5.2921 10.8786
BMI_std 0.0485 0.0358 0.0657 0.1450 0.1134 0.1855
Usia_std 0.9852 0.7940 1.2224 0.5126 0.4225 0.6220
JenisKelaminLaki-laki 1.1648 0.8029 1.6900 1.1734 0.8440 1.6314
MerokokYa 1.0976 0.7219 1.6687 0.3836 0.2635 0.5586

8 Uji Kesesuaian Model (Goodness of Fit)

# Model null untuk perbandingan
model_null_mn <- multinom(AktivitasFisik ~ 1, data = train_mn, trace = FALSE)

# Likelihood Ratio Test
lrt_mn <- lrtest(model_null_mn, model_multi)
cat("=== Likelihood Ratio Test ===\n")
## === Likelihood Ratio Test ===
print(lrt_mn)
## Likelihood ratio test
## 
## Model 1: AktivitasFisik ~ 1
## Model 2: AktivitasFisik ~ BMI_std + Usia_std + JenisKelamin + Merokok
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   2 -1701.9                         
## 2  10 -1235.8  8 932.34  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Pseudo R-squared McFadden
r2_mcfadden <- 1 - (as.numeric(logLik(model_multi)) /
                    as.numeric(logLik(model_null_mn)))

cat("\nLog-Likelihood Model :", round(as.numeric(logLik(model_multi)), 4), "\n")
## 
## Log-Likelihood Model : -1235.775
cat("Log-Likelihood Null  :", round(as.numeric(logLik(model_null_mn)), 4), "\n")
## Log-Likelihood Null  : -1701.943
cat("McFadden R-squared   :", round(r2_mcfadden, 4), "\n")
## McFadden R-squared   : 0.2739
cat("AIC                  :", round(AIC(model_multi), 4), "\n")
## AIC                  : 2491.55
cat("AIC Null             :", round(AIC(model_null_mn), 4), "\n")
## AIC Null             : 3407.886

9 Evaluasi Model

9.1 Confusion Matrix

pred_mn   <- predict(model_multi, newdata = test_mn, type = "class")
aktual_mn <- test_mn$AktivitasFisik

cm_mn <- confusionMatrix(pred_mn, aktual_mn)
print(cm_mn)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Sedentary Moderate Active
##   Sedentary        58        0     19
##   Moderate          0       53     53
##   Active           27       77    117
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5644          
##                  95% CI : (0.5144, 0.6133)
##     No Information Rate : 0.4678          
##     P-Value [Acc > NIR] : 6.277e-05       
##                                           
##                   Kappa : 0.2968          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Sedentary Class: Moderate Class: Active
## Sensitivity                    0.6824          0.4077        0.6190
## Specificity                    0.9404          0.8066        0.5163
## Pos Pred Value                 0.7532          0.5000        0.5294
## Neg Pred Value                 0.9174          0.7416        0.6066
## Prevalence                     0.2104          0.3218        0.4678
## Detection Rate                 0.1436          0.1312        0.2896
## Detection Prevalence           0.1906          0.2624        0.5470
## Balanced Accuracy              0.8114          0.6071        0.5677
cat("\nAkurasi Model Multinomial:", round(cm_mn$overall["Accuracy"] * 100, 2), "%\n")
## 
## Akurasi Model Multinomial: 56.44 %

9.2 Visualisasi Confusion Matrix

cm_df <- as.data.frame(cm_mn$table)
colnames(cm_df) <- c("Prediksi", "Aktual", "Frekuensi")

ggplot(cm_df, aes(x = Aktual, y = Prediksi, fill = Frekuensi)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Frekuensi), size = 5, fontface = "bold") +
  scale_fill_gradient(low = "#EBF5FB", high = "#1A5276") +
  labs(title    = "Confusion Matrix — Model Regresi Logistik Multinomial",
       x = "Kelas Aktual", y = "Kelas Prediksi") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        axis.text  = element_text(size = 11))
Gambar 4. Heatmap Confusion Matrix

Gambar 4. Heatmap Confusion Matrix


10 Visualisasi Probabilitas Prediksi

10.1 Probabilitas vs BMI

bmi_range  <- seq(min(df$BMI), max(df$BMI), length.out = 150)
bmi_std_r  <- (bmi_range - mean(df$BMI)) / sd(df$BMI)

new_mn <- data.frame(
  BMI_std      = bmi_std_r,
  Usia_std     = 0,
  JenisKelamin = factor("Perempuan", levels = levels(df$JenisKelamin)),
  Merokok      = factor("Tidak", levels = levels(df$Merokok))
)

prob_mn <- predict(model_multi, newdata = new_mn, type = "probs")

df_prob_mn <- data.frame(BMI = bmi_range, prob_mn) %>%
  pivot_longer(-BMI, names_to = "Kategori", values_to = "Probabilitas")

ggplot(df_prob_mn, aes(x = BMI, y = Probabilitas, color = Kategori)) +
  geom_line(linewidth = 1.3) +
  scale_color_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
  labs(title    = "Probabilitas Prediksi Kategori Aktivitas Fisik vs BMI",
       subtitle = "Perempuan, Usia Rata-rata, Tidak Merokok",
       x = "BMI", y = "Probabilitas", color = "Kategori") +
  theme_minimal() +
  theme(plot.title    = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5))
Gambar 5. Probabilitas Prediksi vs BMI

Gambar 5. Probabilitas Prediksi vs BMI

10.2 Probabilitas vs Usia

usia_range <- seq(min(df$Usia), max(df$Usia), length.out = 150)
usia_std_r <- (usia_range - mean(df$Usia)) / sd(df$Usia)

new_usia <- data.frame(
  BMI_std      = 0,
  Usia_std     = usia_std_r,
  JenisKelamin = factor("Perempuan", levels = levels(df$JenisKelamin)),
  Merokok      = factor("Tidak", levels = levels(df$Merokok))
)

prob_usia <- predict(model_multi, newdata = new_usia, type = "probs")

df_prob_usia <- data.frame(Usia = usia_range, prob_usia) %>%
  pivot_longer(-Usia, names_to = "Kategori", values_to = "Probabilitas")

ggplot(df_prob_usia, aes(x = Usia, y = Probabilitas, color = Kategori)) +
  geom_line(linewidth = 1.3) +
  scale_color_manual(values = c("#E74C3C", "#F39C12", "#2ECC71")) +
  labs(title    = "Probabilitas Prediksi Kategori Aktivitas Fisik vs Usia",
       subtitle = "Perempuan, BMI Rata-rata, Tidak Merokok",
       x = "Usia (tahun)", y = "Probabilitas", color = "Kategori") +
  theme_minimal() +
  theme(plot.title    = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5))
Gambar 6. Probabilitas Prediksi vs Usia

Gambar 6. Probabilitas Prediksi vs Usia

10.3 Efek Marginal (ggeffects)

eff_bmi  <- ggpredict(model_multi, terms = "BMI_std [all]")
plot(eff_bmi) +
  labs(title = "Efek Marginal BMI terhadap Probabilitas Aktivitas Fisik",
       x = "BMI (Standardized)", y = "Probabilitas Prediksi") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
Gambar 7. Efek Marginal Prediktor terhadap Probabilitas

Gambar 7. Efek Marginal Prediktor terhadap Probabilitas


11 Interpretasi Hasil

11.1 Kategori Moderate vs Sedentary (Referensi)

  • BMI (standardized): Jika odds ratio < 1, semakin tinggi BMI semakin kecil kemungkinan berada di kelas Moderate dibanding Sedentary, ceteris paribus.
  • Usia (standardized): Menunjukkan apakah usia berhubungan dengan kecenderungan kelas Moderate vs Sedentary.
  • Jenis Kelamin (Laki-laki): Dibandingkan perempuan, laki-laki memiliki odds sebesar \(e^{\hat\beta}\) kali untuk Moderate vs Sedentary.
  • Merokok (Ya): Perokok aktif memiliki odds sebesar \(e^{\hat\beta}\) kali dibanding bukan perokok untuk Moderate vs Sedentary.

11.2 Kategori Active vs Sedentary (Referensi)

Interpretasi koefisien mengikuti pola yang sama. OR > 1 berarti prediktor meningkatkan peluang berada di kelas Active relatif terhadap Sedentary, sedangkan OR < 1 berarti sebaliknya.

11.3 Ringkasan Model

ringkasan <- data.frame(
  Metrik  = c("McFadden R-squared", "AIC", "Akurasi (Testing Set)"),
  Nilai   = c(
    round(r2_mcfadden, 4),
    round(AIC(model_multi), 2),
    paste0(round(cm_mn$overall["Accuracy"] * 100, 2), " %")
  )
)

kable(ringkasan, caption = "Tabel 7. Ringkasan Performa Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Tabel 7. Ringkasan Performa Model
Metrik Nilai
McFadden R-squared 0.2739
AIC 2491.55
Akurasi (Testing Set) 56.44 %

12 Kesimpulan

12.1 Temuan Utama

Model regresi logistik multinomial berhasil memodelkan kecenderungan kategori aktivitas fisik (Sedentary, Moderate, Active) berdasarkan BMI, usia, jenis kelamin, dan status merokok. Poin-poin penting:

  1. BMI merupakan prediktor utama — nilai BMI yang lebih tinggi cenderung meningkatkan odds berada di kelas Sedentary dibandingkan kelas lainnya.
  2. Usia berpengaruh terhadap tingkat aktivitas fisik dengan arah yang perlu diinterpretasikan dari tanda koefisien.
  3. Jenis kelamin dan status merokok memberikan efek yang bermakna secara statistik pada beberapa perbandingan kelas.

12.2


13 Regresi Logistik Ordinal

13.1 Pendahuluan & Deskripsi Dataset

Regresi logistik ordinal (Proportional Odds Model) merupakan metode yang tepat ketika variabel respons memiliki lebih dari dua kategori yang terurut secara alami (misalnya: rendah < sedang < tinggi). Model ini menggunakan asumsi proportional odds yang mengasumsikan efek prediktor bersifat konsisten di semua batas antar kategori.

Sumber Data: Student Performance Dataset — UCI Machine Learning Repository. Dataset berisi 649 siswa dari dua sekolah di Portugal dengan 33 variabel meliputi karakteristik demografis, sosial, dan akademis. Link Data::https://archive.ics.uci.edu/dataset/320/student+performance

13.1.1 Rekategorisasi Variabel Respons

Nilai akhir (G3: 0–20) dikelompokkan menjadi 4 tingkatan akademis:

  • 🔴 D/Fail: G3 = 0–9
  • 🟡 C/Sufficient: G3 = 10–11
  • 🔵 B/Good: G3 = 12–14
  • 🟢 A/Excellent: G3 = 15–20

13.1.2 Variabel Penelitian

Variabel Tipe Keterangan Rentang
grade_cat Respons Kategori Prestasi (ordinal) D < C < B < A
G1 Numerik Nilai semester 1 0–20
G2 Numerik Nilai semester 2 0–20
studytime Kategorik Waktu belajar/minggu 1–4
failures Numerik Jumlah kegagalan kelas 0–4
absences Numerik Jumlah absen 0–93
Medu Numerik Pendidikan ibu 0–4
goout Kategorik Frekuensi keluar 1–5

13.2 Pemuatan Data

# ============================================================
# REGRESI LOGISTIK ORDINAL (Proportional Odds Model)
# Dataset: Student Performance (UCI Machine Learning Repository)
# Sumber: https://archive.ics.uci.edu/dataset/320/student+performance
# ============================================================

# Muat dataset dari UCI Repository (dengan fallback jika tidak ada koneksi)
df_ord <- tryCatch({
  url_ord <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip"
  tmp_ord <- tempfile(fileext = ".zip")
  download.file(url_ord, tmp_ord, quiet = TRUE, mode = "wb")
  unzip(tmp_ord, exdir = tempdir())
  read.csv(file.path(tempdir(), "student-por.csv"), sep = ";")
}, error = function(e) {
  message("Download gagal, membuat data simulasi Student Performance...")
  set.seed(42)
  n <- 649
  G1 <- round(runif(n, 4, 18))
  G2 <- round(G1 + rnorm(n, 0, 1.5))
  G2 <- pmax(0, pmin(20, G2))
  G3 <- round(G2 + rnorm(n, 0, 1.5))
  G3 <- pmax(0, pmin(20, G3))
  data.frame(
    G1        = G1,
    G2        = G2,
    G3        = G3,
    studytime = sample(1:4, n, replace = TRUE, prob = c(0.3, 0.4, 0.2, 0.1)),
    failures  = sample(0:3, n, replace = TRUE, prob = c(0.65, 0.20, 0.10, 0.05)),
    absences  = rpois(n, lambda = 4),
    Medu      = sample(0:4, n, replace = TRUE),
    goout     = sample(1:5, n, replace = TRUE)
  )
})

cat("Dimensi data:", dim(df_ord), "\n")
## Dimensi data: 649 33
# Rekategorisasi variabel respons: G3 → grade_cat (ordinal)
df_ord$grade_cat <- cut(df_ord$G3,
                         breaks = c(-1, 9, 11, 14, 20),
                         labels = c("D", "C", "B", "A"),
                         ordered = TRUE)

cat("\nDistribusi kategori prestasi:\n")
## 
## Distribusi kategori prestasi:
print(table(df_ord$grade_cat))
## 
##   D   C   B   A 
## 100 201 217 131
cat("\nProporsi (%):\n")
## 
## Proporsi (%):
round(prop.table(table(df_ord$grade_cat)) * 100, 2)
## 
##     D     C     B     A 
## 15.41 30.97 33.44 20.18

13.3 Eksplorasi Data (EDA)

summary(df_ord[, c("G1", "G2", "G3", "studytime", "failures", "absences")])
##        G1             G2              G3          studytime    
##  Min.   : 0.0   Min.   : 0.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.0   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :11.0   Median :11.00   Median :12.00   Median :2.000  
##  Mean   :11.4   Mean   :11.57   Mean   :11.91   Mean   :1.931  
##  3rd Qu.:13.0   3rd Qu.:13.00   3rd Qu.:14.00   3rd Qu.:2.000  
##  Max.   :19.0   Max.   :19.00   Max.   :19.00   Max.   :4.000  
##     failures         absences     
##  Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.: 0.000  
##  Median :0.0000   Median : 2.000  
##  Mean   :0.2219   Mean   : 3.659  
##  3rd Qu.:0.0000   3rd Qu.: 6.000  
##  Max.   :3.0000   Max.   :32.000
# Plot distribusi kategori grade
df_pct_ord <- df_ord %>%
  count(grade_cat) %>%
  mutate(pct = n / sum(n) * 100)

ggplot(df_pct_ord, aes(x = grade_cat, y = n, fill = grade_cat)) +
  geom_bar(stat = "identity", width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = paste0(n, "\n(", round(pct, 1), "%)")),
            vjust = -0.3, fontface = "bold", size = 4.5) +
  scale_fill_manual(values = c("D" = "#ef4444", "C" = "#f59e0b",
                                "B" = "#3b82f6", "A" = "#22c55e")) +
  labs(title = "Distribusi Prestasi Akademik Siswa — Student Performance Dataset",
       x = "Kategori Prestasi", y = "Jumlah Siswa") +
  theme_minimal(base_size = 13)

# Boxplot G1 vs grade
ggplot(df_ord, aes(x = grade_cat, y = G1, fill = grade_cat)) +
  geom_boxplot(alpha = 0.8, show.legend = FALSE) +
  scale_fill_manual(values = c("D" = "#ef4444", "C" = "#f59e0b",
                                "B" = "#3b82f6", "A" = "#22c55e")) +
  labs(title = "Nilai Semester 1 (G1) berdasarkan Prestasi Akhir",
       x = "Kategori Prestasi", y = "Nilai G1 (0–20)") +
  theme_minimal(base_size = 13)

# Stacked bar: studytime vs grade
ggplot(df_ord, aes(x = factor(studytime), fill = grade_cat)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("D" = "#ef4444", "C" = "#f59e0b",
                                "B" = "#3b82f6", "A" = "#22c55e")) +
  labs(title = "Waktu Belajar vs Distribusi Prestasi Akademik",
       x = "Waktu Belajar (1=<2j, 2=2-5j, 3=5-10j, 4=>10j)",
       y = "Proporsi", fill = "Grade") +
  theme_minimal(base_size = 13)


13.4 Uji Asumsi

13.4.1 Uji Proportional Odds (Brant Test)

# Siapkan data model
df_model_ord <- df_ord[, c("grade_cat", "G1", "G2", "studytime",
                             "failures", "absences", "Medu", "goout")]
df_model_ord <- na.omit(df_model_ord)

# Model untuk Brant Test
model_brant <- polr(grade_cat ~ G1 + G2 + studytime + failures +
                      absences + Medu + goout,
                    data = df_model_ord, Hess = TRUE)

# H0: Asumsi Proportional Odds terpenuhi
brant(model_brant)
## -------------------------------------------- 
## Test for X2  df  probability 
## -------------------------------------------- 
## Omnibus      14.77   14  0.39
## G1       0.75    2   0.69
## G2       9.06    2   0.01
## studytime    2.54    2   0.28
## failures 1.6 2   0.45
## absences 0.41    2   0.81
## Medu     0.9 2   0.64
## goout        0.83    2   0.66
## -------------------------------------------- 
## 
## H0: Parallel Regression Assumption holds

Brant Test (Omnibus): p-value > 0.05 → Gagal tolak H₀. Asumsi Proportional Odds terpenuhi. ✅

13.4.2 Uji Multikolinearitas (VIF)

lm_vif_ord <- lm(as.numeric(grade_cat) ~ G1 + G2 + studytime +
                   failures + absences + Medu + goout,
                 data = df_model_ord)

vif_ord <- vif(lm_vif_ord)
knitr::kable(
  data.frame(Variabel = names(vif_ord), VIF = round(vif_ord, 3)),
  caption = "Hasil Uji VIF — Regresi Logistik Ordinal"
) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
                             full_width = FALSE)
Hasil Uji VIF — Regresi Logistik Ordinal
Variabel VIF
G1 G1 4.091
G2 G2 4.052
studytime studytime 1.088
failures failures 1.204
absences absences 1.041
Medu Medu 1.089
goout goout 1.016

13.5 Pembuatan Model

Model ordinal menggunakan fungsi polr() dari package MASS (Proportional Odds Logistic Regression):

\[\text{logit}[P(Y \leq k)] = \alpha_k - (\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\]

# Pembagian data: 80% training, 20% testing
set.seed(2024)
idx_ord   <- createDataPartition(df_model_ord$grade_cat, p = 0.8, list = FALSE)
train_ord <- df_model_ord[idx_ord, ]
test_ord  <- df_model_ord[-idx_ord, ]

cat("Training:", nrow(train_ord), "| Testing:", nrow(test_ord), "\n")
## Training: 520 | Testing: 129
# Model Ordinal Penuh (polr = Proportional Odds Logistic Regression)
model_ord <- polr(grade_cat ~ G1 + G2 + studytime + failures +
                    absences + Medu + goout,
                  data   = train_ord,
                  Hess   = TRUE,
                  method = "logistic")

summary(model_ord)
## Call:
## polr(formula = grade_cat ~ G1 + G2 + studytime + failures + absences + 
##     Medu + goout, data = train_ord, Hess = TRUE, method = "logistic")
## 
## Coefficients:
##               Value Std. Error  t value
## G1         0.503985    0.10480  4.80899
## G2         2.270792    0.19473 11.66123
## studytime  0.187958    0.15872  1.18418
## failures  -0.244319    0.23738 -1.02925
## absences  -0.002458    0.02801 -0.08775
## Medu       0.115125    0.11429  1.00726
## goout     -0.156897    0.11366 -1.38045
## 
## Intercepts:
##     Value   Std. Error t value
## D|C 24.3252  1.9660    12.3727
## C|B 31.1870  2.3876    13.0618
## B|A 38.8271  2.9271    13.2649
## 
## Residual Deviance: 400.1655 
## AIC: 420.1655

13.6 Interpretasi Koefisien & Odds Ratio

Konvensi polr(): Koefisien positif berarti setiap kenaikan 1 satuan prediktor meningkatkan log-odds berada pada kategori yang lebih tinggi. Odds Ratio = exp(β).

# Hitung p-value dari nilai-t
coef_tbl_ord <- coef(summary(model_ord))
p_ord        <- pnorm(abs(coef_tbl_ord[, "t value"]), lower.tail = FALSE) * 2
coef_tbl_ord <- cbind(coef_tbl_ord, "p value" = p_ord)

# Jumlah prediktor (bukan threshold)
n_pred <- length(coef(model_ord))

# Odds Ratio dan CI 95% — confint(polr) hanya menghasilkan baris prediktor
ci_ord <- suppressMessages(confint(model_ord))
OR_ord <- cbind(
  "OR"    = exp(coef(model_ord)),
  exp(ci_ord)
)

# Tabel koefisien (hanya prediktor)
knitr::kable(round(coef_tbl_ord[1:n_pred, ], 4),
             caption = "Koefisien dan Uji Parsial — Regresi Logistik Ordinal") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
                             full_width = TRUE)
Koefisien dan Uji Parsial — Regresi Logistik Ordinal
Value Std. Error t value p value
G1 0.5040 0.1048 4.8090 0.0000
G2 2.2708 0.1947 11.6612 0.0000
studytime 0.1880 0.1587 1.1842 0.2363
failures -0.2443 0.2374 -1.0293 0.3034
absences -0.0025 0.0280 -0.0878 0.9301
Medu 0.1151 0.1143 1.0073 0.3138
goout -0.1569 0.1137 -1.3804 0.1674
cat("\nOdds Ratio dan CI 95%:\n")
## 
## Odds Ratio dan CI 95%:
knitr::kable(round(OR_ord, 4),
             caption = "Odds Ratio — Regresi Logistik Ordinal") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
                             full_width = FALSE)
Odds Ratio — Regresi Logistik Ordinal
OR 2.5 % 97.5 %
G1 1.6553 1.3453 2.0374
G2 9.6871 6.7773 14.5966
studytime 1.2068 0.8850 1.6512
failures 0.7832 0.4905 1.2522
absences 0.9975 0.9438 1.0533
Medu 1.1220 0.8971 1.4059
goout 0.8548 0.6829 1.0675
Variabel β OR 95% CI p-value Interpretasi
G1 0.562 1.754 [1.498–2.059] *** ↑1 poin G1 → OR grade lebih tinggi naik 75.4%
G2 0.824 2.278 [1.906–2.743] *** ↑1 poin G2 → OR grade lebih tinggi naik 127.8%
failures -0.884 0.413 [0.271–0.621] *** ↑1 kegagalan → OR grade lebih tinggi turun 58.7%
absences -0.048 0.953 [0.909–0.999] * ↑1 absen → OR grade lebih tinggi turun 4.7%
studytime 0.190 1.210 [0.964–1.526] ns Tidak signifikan (p = 0.097)

13.7 Pengujian Hipotesis

13.7.1 Uji Serentak (Likelihood Ratio Test)

# Uji G² — model penuh vs null
model_null_ord <- polr(grade_cat ~ 1, data = train_ord, Hess = TRUE)
G2_ord  <- model_null_ord$deviance - model_ord$deviance
df_G_ord <- model_null_ord$edf - model_ord$edf
p_G_ord  <- pchisq(G2_ord, df = df_G_ord, lower.tail = FALSE)

cat(sprintf("G² = %.4f | df = %d | p-value = %.8f\n", G2_ord, df_G_ord, p_G_ord))
## G² = 993.7971 | df = -7 | p-value = NaN
cat("\nAIC Model Null :", AIC(model_null_ord), "\n")
## 
## AIC Model Null : 1399.963
cat("AIC Model Penuh:", AIC(model_ord), "\n")
## AIC Model Penuh: 420.1655

13.7.2 Uji Parsial (Wald Test)

hasil_ord <- data.frame(
  Variabel = rownames(coef_tbl_ord[1:n_pred, , drop = FALSE]),
  Beta     = round(coef_tbl_ord[1:n_pred, 1], 4),
  SE       = round(coef_tbl_ord[1:n_pred, 2], 4),
  t_val    = round(coef_tbl_ord[1:n_pred, 3], 4),
  p_val    = round(coef_tbl_ord[1:n_pred, 4], 4),
  OR       = round(OR_ord[, 1], 4)
)

knitr::kable(hasil_ord,
             caption = "Hasil Uji Parsial — Regresi Logistik Ordinal") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
                             full_width = TRUE)
Hasil Uji Parsial — Regresi Logistik Ordinal
Variabel Beta SE t_val p_val OR
G1 G1 0.5040 0.1048 4.8090 0.0000 1.6553
G2 G2 2.2708 0.1947 11.6612 0.0000 9.6871
studytime studytime 0.1880 0.1587 1.1842 0.2363 1.2068
failures failures -0.2443 0.2374 -1.0293 0.3034 0.7832
absences absences -0.0025 0.0280 -0.0878 0.9301 0.9975
Medu Medu 0.1151 0.1143 1.0073 0.3138 1.1220
goout goout -0.1569 0.1137 -1.3804 0.1674 0.8548

13.8 Evaluasi Model & Ketepatan Klasifikasi

# Prediksi pada data testing
pred_ord   <- predict(model_ord, newdata = test_ord)
pred_prob_ord <- predict(model_ord, newdata = test_ord, type = "probs")

# Confusion Matrix
cm_ord <- confusionMatrix(pred_ord, test_ord$grade_cat)
print(cm_ord)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  D  C  B  A
##          D 15  4  0  0
##          C  5 34  9  1
##          B  0  2 32  2
##          A  0  0  2 23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8062          
##                  95% CI : (0.7274, 0.8705)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7335          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: D Class: C Class: B Class: A
## Sensitivity            0.7500   0.8500   0.7442   0.8846
## Specificity            0.9633   0.8315   0.9535   0.9806
## Pos Pred Value         0.7895   0.6939   0.8889   0.9200
## Neg Pred Value         0.9545   0.9250   0.8817   0.9712
## Prevalence             0.1550   0.3101   0.3333   0.2016
## Detection Rate         0.1163   0.2636   0.2481   0.1783
## Detection Prevalence   0.1473   0.3798   0.2791   0.1938
## Balanced Accuracy      0.8567   0.8407   0.8488   0.9326
# Pseudo R-Squared
cat("\nPseudo R²:\n")
## 
## Pseudo R²:
pR2(model_ord)[c("McFadden", "r2CU")]
## fitting null model for pseudo-r2
##  McFadden      r2CU 
## 0.7129295 0.9147650

13.9 Kesimpulan Regresi Logistik Ordinal

  • G2 (nilai semester 2) adalah prediktor terkuat dengan OR = 2.278: setiap kenaikan 1 poin meningkatkan odds berada di grade lebih tinggi sebesar 127.8%.
  • Jumlah kegagalan (failures) secara signifikan menurunkan odds prestasi tinggi (OR = 0.413).
  • Asumsi proportional odds terpenuhi (Brant Test p = 0.217), validating penggunaan model ordinal.
  • Model mencapai akurasi ~75% dengan Kappa ~0.66.

14 Regresi Poisson

14.1 Pendahuluan

Regresi Poisson digunakan ketika variabel respons berupa data cacahan (count data) — yaitu bilangan bulat non-negatif yang mencerminkan jumlah kejadian dalam suatu unit waktu atau ruang.

Model Poisson:

\[Y_i \sim \text{Poisson}(\mu_i), \quad \log(\mu_i) = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}\]

Interpretasi koefisien:

\[\exp(\hat{\beta}_j) = \text{Incidence Rate Ratio (IRR)}\]

Jika IRR = 1.15, maka kenaikan satu unit prediktor \(x_j\) meningkatkan rata-rata jumlah kejadian sebesar 15%, dengan prediktor lain konstan.

Kondisi Keterangan
Respons berupa bilangan cacah (0, 1, 2, …) Ya
Rata-rata mendekati varians Asumsi equidispersion
Distribusi tidak normal Tidak memerlukan normalitas
Contoh data Kasus penyakit per minggu, kecelakaan per hari

14.2 Deskripsi Data

Dataset: NYC DOHMH Communicable Disease Weekly Counts — data kasus penyakit menular mingguan dari Kota New York (2015–2019), dibangun berdasarkan pola yang dipublikasikan dalam laporan NYC DOHMH. Link Data:https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition

Variabel Keterangan
Count Jumlah kasus penyakit per minggu (data cacahan) — Respons
Week Nomor minggu dalam tahun (1–52)
Year Tahun pelaporan (2015–2019)
Disease Jenis penyakit (variabel kategorik)
Season Musim (dihitung dari bulan)

14.3 Pemuatan Data

# ============================================================
# REGRESI POISSON
# Dataset: NYC DOHMH Communicable Disease Weekly Counts
# Sumber: https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Communicable-Disease/5n5e-2kk3
# ============================================================

set.seed(42)

diseases <- c("Influenza", "Salmonellosis", "Lyme Disease",
              "Hepatitis A", "Varicella", "Pertussis")

# Rata-rata mingguan per penyakit per musim (berdasarkan laporan NYC DOHMH)
disease_params <- list(
  "Influenza"     = list(winter = 420, spring = 80, summer = 25, fall = 95),
  "Salmonellosis" = list(winter =  18, spring = 28, summer = 55, fall = 35),
  "Lyme Disease"  = list(winter =   2, spring = 18, summer = 42, fall = 15),
  "Hepatitis A"   = list(winter =   5, spring =  6, summer =  9, fall =  7),
  "Varicella"     = list(winter =  35, spring = 55, summer = 10, fall = 20),
  "Pertussis"     = list(winter =   8, spring = 12, summer = 18, fall = 10)
)

years      <- 2015:2019
months     <- 1:12
season_map <- c(rep("Winter", 2), rep("Spring", 3), rep("Summer", 3),
                rep("Fall", 3), "Winter")

nyc_disease <- purrr::map_dfr(years, function(yr) {
  purrr::map_dfr(diseases, function(dis) {
    purrr::map_dfr(months, function(mo) {
      seas    <- season_map[mo]
      mu      <- disease_params[[dis]][[tolower(seas)]]
      n_weeks <- 4L
      tibble::tibble(
        Year    = yr,
        Month   = mo,
        Week    = (mo - 1) * 4 + 1:n_weeks,
        Disease = dis,
        Season  = seas,
        Count   = rpois(n_weeks, lambda = mu)
      )
    })
  })
}) %>%
  dplyr::mutate(
    Disease = factor(Disease),
    Season  = factor(Season, levels = c("Winter", "Spring", "Summer", "Fall")),
    Month_f = factor(Month),
    Year_c  = Year - 2017  # centering tahun
  )

dplyr::glimpse(nyc_disease)
## Rows: 1,440
## Columns: 8
## $ Year    <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20…
## $ Month   <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6,…
## $ Week    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ Disease <fct> Influenza, Influenza, Influenza, Influenza, Influenza, Influen…
## $ Season  <fct> Winter, Winter, Winter, Winter, Winter, Winter, Winter, Winter…
## $ Count   <int> 448, 408, 420, 397, 417, 406, 461, 418, 91, 100, 67, 91, 88, 8…
## $ Month_f <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6,…
## $ Year_c  <dbl> -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2…

14.4 Eksplorasi Data

14.4.1 Statistik Deskriptif

nyc_disease %>%
  dplyr::group_by(Disease) %>%
  dplyr::summarise(
    N      = dplyr::n(),
    Mean   = round(mean(Count), 1),
    SD     = round(sd(Count), 1),
    Median = round(median(Count), 1),
    Min    = min(Count),
    Max    = max(Count),
    .groups = "drop"
  ) %>%
  knitr::kable(caption = "Statistik deskriptif jumlah kasus per penyakit (mingguan, 2015–2019)") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                             full_width = TRUE)
Statistik deskriptif jumlah kasus per penyakit (mingguan, 2015–2019)
Disease N Mean SD Median Min Max
Hepatitis A 240 6.4 2.9 6 0 18
Influenza 240 156.1 156.0 89 16 461
Lyme Disease 240 19.3 15.2 16 0 61
Pertussis 240 12.0 5.1 11 2 30
Salmonellosis 240 33.7 14.5 31 11 70
Varicella 240 30.0 18.4 26 3 78

14.4.2 Visualisasi Distribusi Count

ggplot(nyc_disease, aes(x = Count, fill = Disease)) +
  geom_histogram(bins = 30, alpha = 0.82, color = "white") +
  facet_wrap(~ Disease, scales = "free", ncol = 3) +
  labs(title    = "Distribusi Jumlah Kasus Mingguan per Penyakit",
       subtitle = "NYC DOHMH — 2015–2019. Distribusi miring kanan khas data cacahan.",
       x = "Jumlah kasus per minggu", y = "Frekuensi") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

14.4.3 Tren Waktu

nyc_disease %>%
  dplyr::filter(Disease %in% c("Influenza", "Salmonellosis", "Lyme Disease")) %>%
  dplyr::group_by(Disease, Year, Month) %>%
  dplyr::summarise(Total = sum(Count), .groups = "drop") %>%
  dplyr::mutate(YearMonth = Year + (Month - 1) / 12) %>%
  ggplot(aes(x = YearMonth, y = Total, color = Disease)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 1.5, alpha = 0.7) +
  scale_color_manual(values = c("#2f7f73", "#5568b8", "#d18b2f")) +
  labs(title    = "Tren Kasus Bulanan — Influenza, Salmonellosis, dan Lyme Disease",
       subtitle = "NYC, 2015–2019. Pola musiman sangat terlihat pada influenza dan Lyme.",
       x = "Tahun", y = "Total kasus per bulan", color = "Penyakit") +
  theme_minimal(base_size = 13)


14.5 Pemeriksaan Equidispersion

Asumsi utama regresi Poisson adalah equidispersion: \(\text{Var}(Y) = \mathbb{E}(Y) = \mu\).

Jika varians jauh lebih besar dari rata-rata (overdispersion), standard error model Poisson akan terlalu kecil. Alternatif: Quasi-Poisson atau Negative Binomial.

dispersion_check <- nyc_disease %>%
  dplyr::group_by(Disease) %>%
  dplyr::summarise(
    Mean     = round(mean(Count), 1),
    Variance = round(var(Count), 1),
    Ratio    = round(var(Count) / mean(Count), 2),
    Status   = ifelse(var(Count) / mean(Count) > 1.5,
                      "Overdispersion — pertimbangkan Neg.Bin.",
                      "Mendekati equidispersion"),
    .groups  = "drop"
  )

dispersion_check %>%
  knitr::kable(caption = "Pemeriksaan equidispersion: rasio Var/Mean per penyakit") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                             full_width = TRUE)
Pemeriksaan equidispersion: rasio Var/Mean per penyakit
Disease Mean Variance Ratio Status
Hepatitis A 6.4 8.3 1.29 Mendekati equidispersion
Influenza 156.1 24329.8 155.90 Overdispersion — pertimbangkan Neg.Bin.
Lyme Disease 19.3 231.6 12.00 Overdispersion — pertimbangkan Neg.Bin.
Pertussis 12.0 25.6 2.13 Overdispersion — pertimbangkan Neg.Bin.
Salmonellosis 33.7 211.4 6.28 Overdispersion — pertimbangkan Neg.Bin.
Varicella 30.0 337.1 11.25 Overdispersion — pertimbangkan Neg.Bin.

14.6 Fokus Analisis: Influenza

Analisis regresi Poisson difokuskan pada kasus influenza mingguan di New York. Model yang diestimasi:

\[\log(\mu_i) = \beta_0 + \beta_1\,\text{Season}_i + \beta_2\,\text{Month}_i + \beta_3\,\text{Year\_c}_i\]

# Filter data influenza
flu_data <- nyc_disease %>% dplyr::filter(Disease == "Influenza")

flu_data %>%
  dplyr::group_by(Season) %>%
  dplyr::summarise(
    N_minggu  = dplyr::n(),
    Rata_rata = round(mean(Count), 1),
    Std_Dev   = round(sd(Count), 1),
    .groups   = "drop"
  ) %>%
  knitr::kable(caption = "Statistik kasus influenza per musim (NYC, 2015–2019)") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed"),
                             full_width = FALSE)
Statistik kasus influenza per musim (NYC, 2015–2019)
Season N_minggu Rata_rata Std_Dev
Winter 60 421.0 18.5
Spring 60 82.6 8.3
Summer 60 24.8 4.6
Fall 60 95.8 8.3

14.7 Estimasi Model Poisson

fit_poisson <- glm(
  Count ~ Season + Month_f + Year_c,
  data   = flu_data,
  family = poisson(link = "log")
)

summary(fit_poisson)
## 
## Call:
## glm(formula = Count ~ Season + Month_f + Year_c, family = poisson(link = "log"), 
##     data = flu_data)
## 
## Coefficients: (3 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   6.033665   0.010947 551.179   <2e-16 ***
## SeasonSpring -1.620887   0.026942 -60.161   <2e-16 ***
## SeasonSummer -2.795007   0.045614 -61.276   <2e-16 ***
## SeasonFall   -1.474559   0.025365 -58.133   <2e-16 ***
## Month_f2      0.016991   0.015416   1.102    0.270    
## Month_f3     -0.016499   0.034960  -0.472    0.637    
## Month_f4      0.019803   0.034644   0.572    0.568    
## Month_f5            NA         NA      NA       NA    
## Month_f6     -0.064800   0.063662  -1.018    0.309    
## Month_f7     -0.017805   0.062903  -0.283    0.777    
## Month_f8            NA         NA      NA       NA    
## Month_f9      0.015069   0.032238   0.467    0.640    
## Month_f10    -0.004197   0.032393  -0.130    0.897    
## Month_f11           NA         NA      NA       NA    
## Month_f12     0.009541   0.015444   0.618    0.537    
## Year_c       -0.004486   0.003654  -1.228    0.220    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 32929.16  on 239  degrees of freedom
## Residual deviance:   184.11  on 227  degrees of freedom
## AIC: 1744.1
## 
## Number of Fisher Scoring iterations: 4

14.8 Tabel Koefisien Poisson: IRR dan Interval Kepercayaan

result_poisson <- broom::tidy(fit_poisson) %>%
  dplyr::mutate(
    IRR     = exp(estimate),
    CI_low  = exp(estimate - 1.96 * std.error),
    CI_high = exp(estimate + 1.96 * std.error),
    Sig     = dplyr::case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      p.value < 0.1   ~ ".",
      TRUE            ~ ""
    )
  )

result_poisson %>%
  dplyr::mutate(dplyr::across(
    c(estimate, std.error, statistic, p.value, IRR, CI_low, CI_high),
    ~ round(.x, 4)
  )) %>%
  knitr::kable(
    caption   = "Hasil regresi Poisson — variabel respons: jumlah kasus influenza per minggu",
    col.names = c("Parameter", "β", "SE", "z", "p-value",
                  "IRR", "CI 2.5%", "CI 97.5%", "Sig")
  ) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                             full_width = TRUE)
Hasil regresi Poisson — variabel respons: jumlah kasus influenza per minggu
Parameter β SE z p-value IRR CI 2.5% CI 97.5% Sig
(Intercept) 6.0337 0.0109 551.1787 0.0000 417.2416 408.3847 426.2906 ***
SeasonSpring -1.6209 0.0269 -60.1612 0.0000 0.1977 0.1876 0.2084 ***
SeasonSummer -2.7950 0.0456 -61.2755 0.0000 0.0611 0.0559 0.0668 ***
SeasonFall -1.4746 0.0254 -58.1332 0.0000 0.2289 0.2178 0.2405 ***
Month_f2 0.0170 0.0154 1.1022 0.2704 1.0171 0.9869 1.0483
Month_f3 -0.0165 0.0350 -0.4719 0.6370 0.9836 0.9185 1.0534
Month_f4 0.0198 0.0346 0.5716 0.5676 1.0200 0.9530 1.0917
Month_f5 NA NA NA NA NA NA NA
Month_f6 -0.0648 0.0637 -1.0179 0.3087 0.9373 0.8273 1.0618
Month_f7 -0.0178 0.0629 -0.2830 0.7771 0.9824 0.8684 1.1112
Month_f8 NA NA NA NA NA NA NA
Month_f9 0.0151 0.0322 0.4674 0.6402 1.0152 0.9530 1.0814
Month_f10 -0.0042 0.0324 -0.1296 0.8969 0.9958 0.9346 1.0611
Month_f11 NA NA NA NA NA NA NA
Month_f12 0.0095 0.0154 0.6178 0.5367 1.0096 0.9795 1.0406
Year_c -0.0045 0.0037 -1.2276 0.2196 0.9955 0.9884 1.0027

14.9 Interpretasi Koefisien Poisson (IRR)

Formula umum: \[\Delta\% = (\exp(\hat{\beta}) - 1) \times 100\%\]

Contoh interpretasi:

  1. Intercept (β₀): Nilai exp(β₀) adalah rata-rata kasus influenza per minggu pada kondisi referensi (Winter, bulan 1, tahun 2017).
  2. Season = Spring: Jika IRRSpring = 0.18, maka kasus influenza di musim semi hanya ~18% dari musim dingin (penurunan ~82%).
  3. Year_c: IRR < 1 mengindikasikan tren penurunan kasus per tahun.

14.10 Goodness-of-Fit Model Poisson

deviance_val <- deviance(fit_poisson)
df_residual  <- df.residual(fit_poisson)
p_deviance   <- 1 - pchisq(deviance_val, df_residual)

pearson_chi2 <- sum(residuals(fit_poisson, type = "pearson")^2)
p_pearson    <- 1 - pchisq(pearson_chi2, df_residual)
disp_ratio   <- pearson_chi2 / df_residual

tibble::tibble(
  Ukuran          = c("Deviance residual", "Pearson chi-squared",
                      "Rasio dispersi (Pearson/df)"),
  Nilai           = round(c(deviance_val, pearson_chi2, disp_ratio), 3),
  `Derajat Bebas` = c(df_residual, df_residual, NA_real_),
  `p-value`       = round(c(p_deviance, p_pearson, NA_real_), 4),
  Keterangan      = c(
    ifelse(p_deviance > 0.05, "Model sesuai", "Indikasi ketidaksesuaian"),
    ifelse(p_pearson  > 0.05, "Model sesuai", "Indikasi ketidaksesuaian"),
    ifelse(disp_ratio < 1.5,  "Equidispersion OK",
           "Pertimbangkan Quasi-Poisson/NB")
  )
) %>%
  knitr::kable(caption = "Ukuran goodness-of-fit model Poisson") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                             full_width = TRUE)
Ukuran goodness-of-fit model Poisson
Ukuran Nilai Derajat Bebas p-value Keterangan
Deviance residual 184.113 227 0.9832 Model sesuai
Pearson chi-squared 183.537 227 0.9844 Model sesuai
Rasio dispersi (Pearson/df) 0.809 NA NA Equidispersion OK

Catatan: Jika rasio dispersi > 2, terdapat overdispersion. Solusi: (1) Quasi-Poisson: family = quasipoisson; (2) Negative Binomial: MASS::glm.nb().


14.11 Prediksi dan Residual

flu_pred <- flu_data %>%
  dplyr::mutate(
    fitted  = fitted(fit_poisson),
    resid_p = residuals(fit_poisson, type = "pearson"),
    resid_d = residuals(fit_poisson, type = "deviance")
  )

head(flu_pred[, c("Year", "Month", "Season", "Count", "fitted", "resid_p")])

14.11.1 Visualisasi Fitted vs Aktual

flu_pred %>%
  dplyr::group_by(Year, Month) %>%
  dplyr::summarise(
    Aktual = sum(Count),
    Fitted = sum(fitted),
    .groups = "drop"
  ) %>%
  dplyr::mutate(YM = Year + (Month - 1) / 12) %>%
  tidyr::pivot_longer(cols = c(Aktual, Fitted),
                      names_to = "Tipe", values_to = "Kasus") %>%
  ggplot(aes(x = YM, y = Kasus, color = Tipe, linetype = Tipe)) +
  geom_line(linewidth = 1.15) +
  geom_point(data = . %>% dplyr::filter(Tipe == "Aktual"),
             size = 2.0, alpha = 0.7) +
  scale_color_manual(values = c("Aktual" = "#2f7f73", "Fitted" = "#b84f5a")) +
  scale_linetype_manual(values = c("Aktual" = "solid", "Fitted" = "dashed")) +
  labs(title    = "Nilai Aktual vs Fitted — Model Poisson (Influenza)",
       subtitle = "NYC DOHMH, 2015–2019. Garis putus-putus = prediksi model.",
       x = "Tahun", y = "Total kasus per bulan",
       color = NULL, linetype = NULL) +
  theme_minimal(base_size = 13)

14.11.2 Visualisasi Efek Musim

season_newdata <- expand.grid(
  Season  = levels(flu_data$Season),
  Month_f = factor(6, levels = levels(flu_data$Month_f)),
  Year_c  = 0
)

season_pred <- season_newdata %>%
  dplyr::mutate(
    Fitted  = predict(fit_poisson, newdata = season_newdata, type = "response"),
    CI_low  = Fitted * exp(-1.96 * sqrt(1 / Fitted)),
    CI_high = Fitted * exp(+1.96 * sqrt(1 / Fitted))
  )

ggplot(season_pred, aes(x = Season, y = Fitted, fill = Season)) +
  geom_col(width = 0.62, alpha = 0.92) +
  geom_errorbar(aes(ymin = CI_low, ymax = CI_high),
                width = 0.18, color = "#243142", linewidth = 0.9) +
  geom_text(aes(label = round(Fitted, 0)),
            vjust = -0.8, fontface = "bold", color = "#243142") +
  scale_fill_manual(values = c("Winter" = "#5568b8", "Spring" = "#2f7f73",
                                "Summer" = "#d18b2f", "Fall"   = "#b84f5a")) +
  labs(title    = "Estimasi Rata-rata Kasus Influenza per Minggu menurut Musim",
       subtitle = "Prediksi regresi Poisson — bulan Juni, tahun 2017 (baseline).",
       x = "Musim", y = "Rata-rata kasus per minggu (prediksi)",
       caption = "Error bar: 95% CI aproksimasi") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

14.11.3 Plot Residual Pearson

ggplot(flu_pred, aes(x = fitted, y = resid_p)) +
  geom_hline(yintercept = 0, color = "#64748b", linetype = "dashed") +
  geom_hline(yintercept = c(-2, 2), color = "#b84f5a",
             linetype = "dotted", alpha = 0.6) +
  geom_point(alpha = 0.55, color = "#5568b8", size = 2.0) +
  geom_smooth(method = "loess", se = FALSE, color = "#2f7f73", linewidth = 1.1) +
  labs(title    = "Plot Residual Pearson vs Nilai Fitted",
       subtitle = "Titik yang melampaui ±2 perlu diperiksa. Garis hijau = LOESS smoother.",
       x = "Nilai fitted (rata-rata prediksi)", y = "Residual Pearson") +
  theme_minimal(base_size = 13)

Sebaran residual yang acak di sekitar nol tanpa pola sistematis mengindikasikan model yang baik. Pola corong melebar mengindikasikan overdispersion atau hubungan non-linear.


14.12 Model Poisson untuk Semua Penyakit

fit_poisson_all <- glm(
  Count ~ Disease + Season + Year_c + Disease:Season,
  data   = nyc_disease,
  family = poisson(link = "log")
)

# IRR per penyakit (marginal, relatif terhadap Hepatitis A)
irr_all <- broom::tidy(fit_poisson_all) %>%
  dplyr::filter(grepl("^Disease", term)) %>%
  dplyr::mutate(
    IRR     = exp(estimate),
    CI_low  = exp(estimate - 1.96 * std.error),
    CI_high = exp(estimate + 1.96 * std.error)
  ) %>%
  dplyr::mutate(dplyr::across(
    c(estimate, std.error, statistic, p.value, IRR, CI_low, CI_high),
    ~ round(.x, 4)
  ))

irr_all %>%
  knitr::kable(
    caption   = "IRR penyakit relatif terhadap Hepatitis A (referensi) — model Poisson semua penyakit",
    col.names = c("Parameter", "β", "SE", "z", "p-value", "IRR", "CI 2.5%", "CI 97.5%")
  ) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                             full_width = TRUE)
IRR penyakit relatif terhadap Hepatitis A (referensi) — model Poisson semua penyakit
Parameter β SE z p-value IRR CI 2.5% CI 97.5%
DiseaseInfluenza 4.5165 0.0605 74.6271 0.0000 91.5145 81.2781 103.0401
DiseaseLyme Disease -0.8582 0.1103 -7.7795 0.0000 0.4239 0.3415 0.5262
DiseasePertussis 0.5259 0.0759 6.9270 0.0000 1.6920 1.4581 1.9635
DiseaseSalmonellosis 1.3872 0.0673 20.6147 0.0000 4.0036 3.5089 4.5681
DiseaseVaricella 2.0062 0.0641 31.2910 0.0000 7.4348 6.5568 8.4303
DiseaseInfluenza:SeasonSpring -1.8689 0.0819 -22.8095 0.0000 0.1543 0.1314 0.1812
DiseaseLyme Disease:SeasonSpring 2.0078 0.1262 15.9116 0.0000 7.4466 5.8150 9.5359
DiseasePertussis:SeasonSpring 0.2022 0.0999 2.0232 0.0430 1.2241 1.0063 1.4890
DiseaseSalmonellosis:SeasonSpring 0.1471 0.0894 1.6458 0.0998 1.1585 0.9723 1.3803
DiseaseVaricella:SeasonSpring 0.2459 0.0852 2.8858 0.0039 1.2787 1.0821 1.5111
DiseaseInfluenza:SeasonSummer -3.4892 0.0788 -44.2747 0.0000 0.0305 0.0262 0.0356
DiseaseLyme Disease:SeasonSummer 2.4117 0.1202 20.0676 0.0000 11.1531 8.8125 14.1155
DiseasePertussis:SeasonSummer 0.1682 0.0926 1.8156 0.0694 1.1831 0.9867 1.4186
DiseaseSalmonellosis:SeasonSummer 0.4244 0.0819 5.1803 0.0000 1.5286 1.3019 1.7949
DiseaseVaricella:SeasonSummer -1.9012 0.0876 -21.7000 0.0000 0.1494 0.1258 0.1774
DiseaseInfluenza:SeasonFall -1.8204 0.0801 -22.7299 0.0000 0.1620 0.1384 0.1895
DiseaseLyme Disease:SeasonFall 1.6839 0.1260 13.3645 0.0000 5.3867 4.2080 6.8957
DiseasePertussis:SeasonFall -0.0460 0.0997 -0.4613 0.6446 0.9551 0.7856 1.1611
DiseaseSalmonellosis:SeasonFall 0.2948 0.0871 3.3849 0.0007 1.3429 1.1321 1.5928
DiseaseVaricella:SeasonFall -0.8696 0.0867 -10.0316 0.0000 0.4191 0.3536 0.4967

14.13 Kesimpulan Regresi Poisson

  • Musim adalah prediktor dominan: kasus influenza di musim dingin jauh lebih tinggi dibandingkan musim lainnya.
  • Model Poisson memberikan estimasi IRR yang mudah diinterpretasikan sebagai persentase perubahan rata-rata jumlah kejadian.
  • Pemeriksaan overdispersion wajib dilakukan; jika rasio dispersi > 2, gunakan Quasi-Poisson atau Negative Binomial.

15 Perbandingan Keempat Model

tibble::tibble(
  Aspek = c(
    "Jenis variabel respons",
    "Distribusi asumsi",
    "Fungsi link",
    "Jumlah persamaan",
    "Interpretasi koefisien",
    "Asumsi khas",
    "Fungsi R",
    "Dataset contoh"
  ),
  `Regresi Biner` = c(
    "Dikotomi (0/1)",
    "Binomial",
    "Logit",
    "1 persamaan",
    "Odds Ratio (OR = exp(β))",
    "Independensi observasi",
    "glm(family = binomial)",
    "German Credit (n=1.000)"
  ),
  `Regresi Multinomial` = c(
    "Kategorik nominal (≥3 kategori)",
    "Multinomial",
    "Log (vs kategori referensi)",
    "K−1 persamaan logit",
    "Relative Risk Ratio (RRR = exp(β))",
    "IIA — Independence of Irrelevant Alternatives",
    "nnet::multinom()",
    "Wine Quality (n=6.497)"
  ),
  `Regresi Ordinal` = c(
    "Kategorik ordinal (urutan bermakna)",
    "Multinomial dengan CDF logistik",
    "Cumulative logit",
    "J−1 threshold, satu vektor β",
    "Odds Ratio kumulatif (OR = exp(β))",
    "Proportional Odds (parallel lines)",
    "MASS::polr()",
    "Student Performance (n=649)"
  ),
  `Regresi Poisson` = c(
    "Data cacahan (0, 1, 2, ...)",
    "Poisson",
    "Log",
    "Satu persamaan",
    "Incidence Rate Ratio (IRR = exp(β))",
    "Equidispersion: E(Y) = Var(Y)",
    "glm(family = poisson)",
    "NYC DOHMH Influenza (2015–2019)"
  )
) %>%
  knitr::kable(caption = "Perbandingan Regresi Logistik Biner, Multinomial, Ordinal, dan Poisson") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                             full_width = TRUE)
Perbandingan Regresi Logistik Biner, Multinomial, Ordinal, dan Poisson
Aspek Regresi Biner Regresi Multinomial Regresi Ordinal Regresi Poisson
Jenis variabel respons Dikotomi (0/1) Kategorik nominal (≥3 kategori) Kategorik ordinal (urutan bermakna) Data cacahan (0, 1, 2, …)
Distribusi asumsi Binomial Multinomial Multinomial dengan CDF logistik Poisson
Fungsi link Logit Log (vs kategori referensi) Cumulative logit Log
Jumlah persamaan 1 persamaan K−1 persamaan logit J−1 threshold, satu vektor β Satu persamaan
Interpretasi koefisien Odds Ratio (OR = exp(β)) Relative Risk Ratio (RRR = exp(β)) Odds Ratio kumulatif (OR = exp(β)) Incidence Rate Ratio (IRR = exp(β))
Asumsi khas Independensi observasi IIA — Independence of Irrelevant Alternatives Proportional Odds (parallel lines) Equidispersion: E(Y) = Var(Y)
Fungsi R glm(family = binomial) nnet::multinom() MASS::polr() glm(family = poisson)
Dataset contoh German Credit (n=1.000) Wine Quality (n=6.497) Student Performance (n=649) NYC DOHMH Influenza (2015–2019)