Dokumen ini menerapkan tiga model untuk variabel respons kategorik dan cacah, mengikuti kerangka materi Regresi Logistik Multinomial, Ordinal, dan Regresi Poisson. Ketiganya merupakan anggota Generalized Linear Model (GLM) yang dipilih menurut tipe variabel responsnya:
| Model | Tipe respons | Fungsi tautan | Fungsi R |
|---|---|---|---|
| Logistik Multinomial | Nominal (≥3, tanpa urutan) | baseline-category logit | nnet::multinom |
| Logistik Ordinal | Ordinal (berurutan) | cumulative logit | MASS::polr |
| Poisson | Cacah (count) | log | glm(family = poisson) |
Untuk tiap model ditempuh alur yang sama seperti pada materi: penyiapan data, estimasi, ringkasan koefisien (beserta standard error, nilai \(z\), dan \(p\)-value), interpretasi melalui odds ratio/RRR/IRR, prediksi probabilitas atau laju, evaluasi, visualisasi, dan pemeriksaan asumsi.
Analisis menggunakan data kualitas kopi dari Coffee Quality Institute (CQI), lembaga sertifikasi mutu kopi internasional. Tiap baris adalah satu sampel kopi yang dinilai panel Q-grader bersertifikat melalui proses cupping, mencakup skor sensorik total, negara asal, ketinggian lahan, metode pengolahan, kelembapan, dan jumlah cacat biji. Data dikompilasi James LeDoux dari basis data resmi CQI dan dipublikasikan terbuka melalui proyek TidyTuesday.
Referensi sumber data:
URL <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv"
d <- read.csv(URL, stringsAsFactors = FALSE)
d <- subset(d, total_cup_points > 0)
d$alt100 <- ifelse(d$altitude_mean_meters < 1 | d$altitude_mean_meters > 3000,
NA, d$altitude_mean_meters) / 100 # ketinggian per 100 m
d$moist_pct <- d$moisture * 100 # kelembapan (%)
keep <- c("Washed / Wet", "Natural / Dry", "Semi-washed / Semi-pulped")
d$Proses <- factor(ifelse(d$processing_method %in% keep, d$processing_method, NA),
levels = keep, labels = c("Washed", "Natural", "Semi-washed"))
d$Grade <- ordered(cut(d$total_cup_points, c(-Inf, 82, 84, Inf),
labels = c("Standard", "Premium", "Excellent")))
cat("Baris mentah terpakai:", nrow(d))## Baris mentah terpakai: 1338
Regresi logistik multinomial memodelkan respons nominal dengan ≥3 kategori melalui pendekatan baseline-category logit. Satu kategori dijadikan acuan, lalu log-odds tiap kategori lain terhadap acuan dimodelkan sebagai fungsi linear prediktor:
\[ \log\!\left(\frac{\pi_{ij}}{\pi_{iJ}}\right) = \beta_{0j} + \beta_{1j}x_{i1} + \dots + \beta_{pj}x_{ip}, \qquad j = 1, \dots, J-1. \]
Kasus: memodelkan metode pengolahan biji kopi (Washed / Natural / Semi-washed; acuan = Washed) berdasarkan ketinggian lahan, kelembapan, dan jumlah cacat.
data_multi <- na.omit(d[, c("Proses", "alt100", "moist_pct", "category_two_defects")])
data_multi$Proses <- relevel(data_multi$Proses, ref = "Washed")
data_multi %>% count(Proses) %>%
mutate(proporsi = round(n / sum(n), 3)) %>%
kable(caption = "Distribusi kategori metode pengolahan.")| Proses | n | proporsi |
|---|---|---|
| Washed | 729 | 0.757 |
| Natural | 182 | 0.189 |
| Semi-washed | 52 | 0.054 |
fit_multi <- multinom(Proses ~ alt100 + moist_pct + category_two_defects,
data = data_multi, trace = FALSE)
summary(fit_multi)## Call:
## multinom(formula = Proses ~ alt100 + moist_pct + category_two_defects,
## data = data_multi, trace = FALSE)
##
## Coefficients:
## (Intercept) alt100 moist_pct category_two_defects
## Natural 0.7272141 -0.1184566 -0.07053556 0.005650944
## Semi-washed -1.3785958 -0.1456347 0.06094413 -0.021289559
##
## Std. Errors:
## (Intercept) alt100 moist_pct category_two_defects
## Natural 0.2977795 0.01970870 0.01800027 0.01528571
## Semi-washed 0.5727869 0.03269322 0.04300557 0.02902267
##
## Residual Deviance: 1245.198
## AIC: 1261.198
co <- summary(fit_multi)$coefficients
se <- summary(fit_multi)$standard.errors
z <- co / se
p <- 2 * (1 - pnorm(abs(z)))
multi_tab <- data.frame(
Kontras = rep(rownames(co), each = ncol(co)),
Variabel = rep(colnames(co), times = nrow(co)),
b = round(as.vector(t(co)), 4),
SE = round(as.vector(t(se)), 4),
z = round(as.vector(t(z)), 3),
p_value = round(as.vector(t(p)), 4),
RRR = round(exp(as.vector(t(co))), 3))
kable(multi_tab, caption = "Estimasi koefisien multinomial (acuan = Washed).")| Kontras | Variabel | b | SE | z | p_value | RRR |
|---|---|---|---|---|---|---|
| Natural | (Intercept) | 0.7272 | 0.2978 | 2.442 | 0.0146 | 2.069 |
| Natural | alt100 | -0.1185 | 0.0197 | -6.010 | 0.0000 | 0.888 |
| Natural | moist_pct | -0.0705 | 0.0180 | -3.919 | 0.0001 | 0.932 |
| Natural | category_two_defects | 0.0057 | 0.0153 | 0.370 | 0.7116 | 1.006 |
| Semi-washed | (Intercept) | -1.3786 | 0.5728 | -2.407 | 0.0161 | 0.252 |
| Semi-washed | alt100 | -0.1456 | 0.0327 | -4.455 | 0.0000 | 0.864 |
| Semi-washed | moist_pct | 0.0609 | 0.0430 | 1.417 | 0.1564 | 1.063 |
| Semi-washed | category_two_defects | -0.0213 | 0.0290 | -0.734 | 0.4632 | 0.979 |
Koefisien dibaca sebagai relative risk ratio (RRR = \(e^{\beta}\)) terhadap kategori acuan Washed. Tiap kenaikan ketinggian 100 m menurunkan kecenderungan biji diproses Natural (RRR 0.888) maupun Semi-washed (RRR 0.864) dibanding diproses Washed — kopi dataran tinggi cenderung diproses basah. Kelembapan lebih tinggi menaikkan kecenderungan proses kering (Natural).
pred_class <- predict(fit_multi, type = "class")
cm <- table(Aktual = data_multi$Proses, Prediksi = pred_class)
kable(cm, caption = "Confusion matrix model multinomial.")| Washed | Natural | Semi-washed | |
|---|---|---|---|
| Washed | 725 | 4 | 0 |
| Natural | 165 | 17 | 0 |
| Semi-washed | 52 | 0 | 0 |
## Akurasi klasifikasi: 77.1%
grid <- data.frame(
alt100 = seq(min(data_multi$alt100), max(data_multi$alt100), length.out = 200),
moist_pct = mean(data_multi$moist_pct),
category_two_defects = mean(data_multi$category_two_defects))
pr <- as.data.frame(predict(fit_multi, newdata = grid, type = "probs"))
plot_df <- cbind(grid["alt100"], pr) %>%
pivot_longer(-alt100, names_to = "Proses", values_to = "prob")
ggplot(plot_df, aes(alt100 * 100, prob, color = Proses)) +
geom_line(linewidth = 1.3) +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = pal) +
labs(title = "Prediksi Probabilitas Metode Pengolahan",
subtitle = "Kelembapan & cacat ditahan pada rata-rata",
x = "Ketinggian lahan (m)", y = "Probabilitas") +
theme_minimal(base_size = 12)Model multinomial mengasumsikan IIA — rasio peluang antara dua kategori tak berubah oleh hadir/tidaknya kategori lain. Sebagai pemeriksaan praktis, kontribusi tiap prediktor diuji melalui likelihood-ratio test dengan membandingkan model penuh dan model tanpa prediktor tersebut.
fit_drop <- multinom(Proses ~ moist_pct + category_two_defects,
data = data_multi, trace = FALSE)
lrt <- 2 * (as.numeric(logLik(fit_multi)) - as.numeric(logLik(fit_drop)))
data.frame(Uji = "Kontribusi alt100 (LRT)",
Selisih_Devians = round(lrt, 3), df = 2,
p_value = signif(1 - pchisq(lrt, 2), 4)) %>%
kable(caption = "Uji rasio likelihood untuk kontribusi ketinggian.")| Uji | Selisih_Devians | df | p_value |
|---|---|---|---|
| Kontribusi alt100 (LRT) | 49.591 | 2 | 0 |
Regresi logistik ordinal (model proportional-odds / cumulative logit) memodelkan respons berurutan. Yang dimodelkan adalah peluang kumulatif menuju kategori lebih tinggi:
\[ \text{logit}\,[P(Y \le j)] = \alpha_j - (\beta_1 x_1 + \dots + \beta_p x_p), \qquad j = 1, \dots, J-1. \]
Satu set koefisien \(\beta\) berlaku untuk semua ambang (asumsi proportional odds).
Kasus: memodelkan grade mutu kopi (Standard < Premium < Excellent) berdasarkan metode pengolahan, ketinggian, dan kelembapan.
data_ord <- na.omit(d[, c("Grade", "Proses", "alt100", "moist_pct")])
data_ord %>% count(Grade) %>% mutate(proporsi = round(n / sum(n), 3)) %>%
kable(caption = "Distribusi grade mutu kopi.")| Grade | n | proporsi |
|---|---|---|
| Standard | 386 | 0.401 |
| Premium | 416 | 0.432 |
| Excellent | 161 | 0.167 |
fit_ord <- polr(Grade ~ Proses + alt100 + moist_pct,
data = data_ord, Hess = TRUE, method = "logistic")
summary(fit_ord)## Call:
## polr(formula = Grade ~ Proses + alt100 + moist_pct, data = data_ord,
## Hess = TRUE, method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## ProsesNatural 0.6583 0.16567 3.973
## ProsesSemi-washed 0.4340 0.28683 1.513
## alt100 0.1409 0.01672 8.426
## moist_pct -0.0546 0.01437 -3.801
##
## Intercepts:
## Value Std. Error t value
## Standard|Premium 1.0208 0.2731 3.7374
## Premium|Excellent 3.1913 0.2920 10.9280
##
## Residual Deviance: 1882.969
## AIC: 1894.969
ct <- coef(summary(fit_ord))
ct <- cbind(ct,
p = round(2 * (1 - pnorm(abs(ct[, "t value"]))), 4),
OR = round(exp(ct[, "Value"]), 3))
kable(round(ct, 4), caption = "Koefisien ordinal; dua baris akhir = ambang (cutpoint).")| Value | Std. Error | t value | p | OR | |
|---|---|---|---|---|---|
| ProsesNatural | 0.6583 | 0.1657 | 3.9734 | 0.0001 | 1.931 |
| ProsesSemi-washed | 0.4340 | 0.2868 | 1.5132 | 0.1302 | 1.543 |
| alt100 | 0.1409 | 0.0167 | 8.4262 | 0.0000 | 1.151 |
| moist_pct | -0.0546 | 0.0144 | -3.8009 | 0.0001 | 0.947 |
| Standard|Premium | 1.0208 | 0.2731 | 3.7374 | 0.0002 | 2.776 |
| Premium|Excellent | 3.1913 | 0.2920 | 10.9280 | 0.0000 | 24.320 |
Koefisien positif memperbesar peluang menuju grade lebih tinggi. Tiap kenaikan ketinggian 100 m menaikkan odds naik grade ~15% (OR 1.151) — penegasan kuantitatif bahwa kopi dataran tinggi cenderung bermutu lebih tinggi. Pengolahan Natural juga berasosiasi dengan grade lebih tinggi dibanding Washed.
pred_ord <- predict(fit_ord, type = "class")
cm_ord <- table(Aktual = data_ord$Grade, Prediksi = pred_ord)
kable(cm_ord, caption = "Confusion matrix model ordinal.")| Standard | Premium | Excellent | |
|---|---|---|---|
| Standard | 203 | 181 | 2 |
| Premium | 134 | 278 | 4 |
| Excellent | 30 | 126 | 5 |
## Akurasi klasifikasi: 50.5%
grid2 <- data.frame(
alt100 = seq(min(data_ord$alt100), max(data_ord$alt100), length.out = 200),
Proses = factor("Washed", levels = levels(data_ord$Proses)),
moist_pct = mean(data_ord$moist_pct))
pr2 <- as.data.frame(predict(fit_ord, newdata = grid2, type = "probs"))
plot2 <- cbind(grid2["alt100"], pr2) %>%
pivot_longer(-alt100, names_to = "Grade", values_to = "prob") %>%
mutate(Grade = factor(Grade, levels = c("Standard", "Premium", "Excellent")))
ggplot(plot2, aes(alt100 * 100, prob, color = Grade)) +
geom_line(linewidth = 1.3) +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = pal) +
labs(title = "Prediksi Probabilitas Grade Mutu menurut Ketinggian",
subtitle = "Proses = Washed; kelembapan pada rata-rata",
x = "Ketinggian lahan (m)", y = "Probabilitas") +
theme_minimal(base_size = 12)Asumsi proportional odds diuji dengan
ordinal::nominal_test: bila suatu prediktor signifikan,
efeknya tidak proporsional di semua ambang sehingga
asumsi dilanggar untuk prediktor tersebut.
fit_clm <- clm(Grade ~ Proses + alt100 + moist_pct, data = data_ord, link = "logit")
nominal_test(fit_clm)Hasil menunjukkan ketinggian (alt100) melanggar asumsi
proportional-odds (\(p < 0{,}05\)),
sehingga model partial-proportional-odds atau multinomial dapat
dipertimbangkan sebagai alternatif jika diperlukan akurasi lebih
tinggi.
Regresi Poisson memodelkan variabel respons berupa cacahan (banyaknya kejadian). Dengan fungsi tautan log:
\[ \log(\mu_i) = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}, \qquad Y_i \sim \text{Poisson}(\mu_i). \]
Koefisien dieksponenkan menjadi Incidence Rate Ratio (IRR). Asumsi utama adalah ekuidispersi (rata-rata = varians).
Kasus: memodelkan jumlah cacat biji
(category_two_defects) berdasarkan metode pengolahan,
ketinggian, dan kelembapan.
data_pois <- na.omit(d[, c("category_two_defects", "Proses", "alt100", "moist_pct")])
ggplot(data_pois, aes(category_two_defects)) +
geom_histogram(binwidth = 1, fill = pal[1], color = "white") +
labs(title = "Distribusi Jumlah Cacat Biji (Category Two Defects)",
x = "Jumlah cacat", y = "Frekuensi") +
theme_minimal(base_size = 12)cat(sprintf("mean = %.2f ; varians = %.2f",
mean(data_pois$category_two_defects), var(data_pois$category_two_defects)))## mean = 3.72 ; varians = 29.73
fit_pois <- glm(category_two_defects ~ Proses + alt100 + moist_pct,
data = data_pois, family = poisson(link = "log"))
summary(fit_pois)##
## Call:
## glm(formula = category_two_defects ~ Proses + alt100 + moist_pct,
## family = poisson(link = "log"), data = data_pois)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.77817 0.07908 9.840 <2e-16 ***
## ProsesNatural 0.04380 0.04419 0.991 0.3216
## ProsesSemi-washed -0.17088 0.07883 -2.168 0.0302 *
## alt100 -0.00721 0.00420 -1.717 0.0860 .
## moist_pct 0.06272 0.00490 12.800 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5329.8 on 962 degrees of freedom
## Residual deviance: 5131.5 on 958 degrees of freedom
## AIC: 7343.8
##
## Number of Fisher Scoring iterations: 6
sp <- summary(fit_pois)$coefficients
kable(round(cbind(sp, IRR = exp(sp[, "Estimate"])), 4),
caption = "Koefisien Poisson dan Incidence Rate Ratio (IRR).")| Estimate | Std. Error | z value | Pr(>|z|) | IRR | |
|---|---|---|---|---|---|
| (Intercept) | 0.7782 | 0.0791 | 9.8400 | 0.0000 | 2.1775 |
| ProsesNatural | 0.0438 | 0.0442 | 0.9911 | 0.3216 | 1.0448 |
| ProsesSemi-washed | -0.1709 | 0.0788 | -2.1677 | 0.0302 | 0.8429 |
| alt100 | -0.0072 | 0.0042 | -1.7167 | 0.0860 | 0.9928 |
| moist_pct | 0.0627 | 0.0049 | 12.8003 | 0.0000 | 1.0647 |
Interpretasi. Pengolahan Natural menaikkan laju cacat ~4% (IRR 1.045) dibanding Washed — masuk akal karena pengeringan dengan kulit buah lebih rawan cacat. Kelembapan lebih tinggi juga menaikkan laju cacat.
grid3 <- expand.grid(
alt100 = seq(min(data_pois$alt100), max(data_pois$alt100), length.out = 200),
Proses = factor(c("Washed", "Natural"), levels = levels(data_pois$Proses)),
moist_pct = mean(data_pois$moist_pct))
grid3$pred <- predict(fit_pois, newdata = grid3, type = "response")
ggplot(grid3, aes(alt100 * 100, pred, color = Proses)) +
geom_line(linewidth = 1.3) +
scale_color_manual(values = pal) +
labs(title = "Prediksi Laju Jumlah Cacat",
x = "Ketinggian lahan (m)", y = "Perkiraan jumlah cacat") +
theme_minimal(base_size = 12)##
## Overdispersion test
##
## data: fit_pois
## z = 7.0688, p-value = 7.812e-13
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 7.396174
Statistik dispersi jauh di atas 1 dengan \(p\)-value sangat kecil ⇒ terjadi overdispersi (varians ≫ rata-rata), sehingga galat baku Poisson terlalu optimis. Solusi yang lazim adalah Negative Binomial:
fit_nb <- glm.nb(category_two_defects ~ Proses + alt100 + moist_pct, data = data_pois)
data.frame(Model = c("Poisson", "Negative Binomial"),
AIC = c(round(AIC(fit_pois), 1), round(AIC(fit_nb), 1))) %>%
kable(caption = "Perbandingan AIC: Negative Binomial jauh lebih baik.")| Model | AIC |
|---|---|
| Poisson | 7343.8 |
| Negative Binomial | 4639.9 |
data.frame(
Tipe_Respons = c("Nominal (≥3 kategori)", "Ordinal (berurutan)", "Cacah / count"),
Model = c("Logistik Multinomial", "Logistik Ordinal", "Poisson / Negative Binomial"),
Fungsi_R = c("nnet::multinom", "MASS::polr", "glm(poisson) / MASS::glm.nb"),
Asumsi_Kunci = c("IIA", "Proportional odds", "Ekuidispersi (cek overdispersi)")
) %>% kable(caption = "Pedoman pemilihan model menurut tipe respons.")| Tipe_Respons | Model | Fungsi_R | Asumsi_Kunci |
|---|---|---|---|
| Nominal (≥3 kategori) | Logistik Multinomial | nnet::multinom | IIA |
| Ordinal (berurutan) | Logistik Ordinal | MASS::polr | Proportional odds |
| Cacah / count | Poisson / Negative Binomial | glm(poisson) / MASS::glm.nb | Ekuidispersi (cek overdispersi) |