Inferensi pada tabel kontingensi dua arah merupakan salah satu metode statistik yang digunakan untuk menguji dan mengukur hubungan (asosiasi) antara dua variabel kategori. Dalam laporan ini, dibahas dua kasus utama:
# Load semua paket yang dibutuhkan
library(knitr) # tabel rapi
library(kableExtra) # styling tabel tambahan
library(epitools) # RR, OR + CI
library(vcd) # mosaic plot, residual
library(ggplot2) # visualisasi
library(scales) # format axis
library(dplyr) # manipulasi data📌 Definisi Formal
Tabel kontingensi (contingency table) adalah suatu tabel klasifikasi silang (cross-tabulation) dari dua atau lebih variabel yang bersifat kategori. Tabel ini menampilkan frekuensi kemunculan setiap kombinasi kategori sehingga memungkinkan peneliti untuk melihat pola hubungan secara ringkas dan sistematis.
Penamaan tabel kontingensi didasarkan pada jumlah variabel dan kategori: 2 variabel → tabel dua arah; jika variabel X memiliki \(I\) kategori dan Y memiliki \(J\) kategori → Tabel I × J. Dalam laporan ini mencakup tabel 2×2 (Kasus 1) dan 2×3 (Kasus 2).
Untuk tabel kontingensi 2×2 dengan variabel \(X\) (baris, 2 kategori) dan \(Y\) (kolom, 2 kategori):
\[ \begin{array}{c|cc|c} X \backslash Y & y_1 & y_2 & \text{Total} \\ \hline x_1 & n_{11} & n_{12} & n_{1.} \\ x_2 & n_{21} & n_{22} & n_{2.} \\ \hline \text{Total} & n_{.1} & n_{.2} & n_{..} \end{array} \]
| Notasi | Keterangan |
|---|---|
| \(n_{ij}\) | Frekuensi sel baris ke-\(i\), kolom ke-\(j\) |
| \(n_{i.} = \sum_j n_{ij}\) | Total baris ke-\(i\) |
| \(n_{.j} = \sum_i n_{ij}\) | Total kolom ke-\(j\) |
| \(n_{..} = \sum_i \sum_j n_{ij}\) | Total keseluruhan |
| \(\pi_{ij} = n_{ij}/n_{..}\) | Peluang bersama (joint probability) |
| \(\pi_{j \mid i} = n_{ij}/n_{i.}\) | Peluang bersyarat \(Y=j\) given \(X=i\) |
Rumus Distribusi Peluang
Bersama (Joint): \(\pi_{ij} = P(X=i, Y=j) = \dfrac{n_{ij}}{n_{..}}\)
Marginal: \(\pi_{i.} = \dfrac{n_{i.}}{n_{..}}\), \(\quad \pi_{.j} = \dfrac{n_{.j}}{n_{..}}\)
Bersyarat (Conditional): \(\pi_{j \mid i} = P(Y=j \mid X=i) = \dfrac{n_{ij}}{n_{i.}}\)
Independensi: \(X \perp Y \iff \pi_{ij} = \pi_{i.} \times \pi_{.j}\)
Rumus Risk Difference
\[RD = \pi_{1|1} - \pi_{1|2} = \frac{n_{11}}{n_{1.}} - \frac{n_{21}}{n_{2.}}\]
Interval Kepercayaan \((1-\alpha)\)%: \[RD \pm z_{\alpha/2} \sqrt{\frac{\hat{\pi}_{1|1}(1-\hat{\pi}_{1|1})}{n_{1.}} + \frac{\hat{\pi}_{1|2}(1-\hat{\pi}_{1|2})}{n_{2.}}}\]
Rumus Relative Risk
\[RR = \frac{\pi_{1|1}}{\pi_{1|2}} = \frac{n_{11}/n_{1.}}{n_{21}/n_{2.}}\]
Interval Kepercayaan (via log-transformasi): \[\exp\!\left(\ln RR \pm z_{\alpha/2} \cdot SE_{\ln RR}\right), \quad SE_{\ln RR} = \sqrt{\frac{1-\hat{\pi}_{1|1}}{n_{11}} + \frac{1-\hat{\pi}_{1|2}}{n_{21}}}\]
Rumus Odds Ratio
\[OR = \frac{n_{11} \times n_{22}}{n_{12} \times n_{21}}\]
Interval Kepercayaan (metode Woolf): \[\exp\!\left(\ln OR \pm z_{\alpha/2} \cdot SE_{\ln OR}\right), \quad SE_{\ln OR} = \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}} + \frac{1}{n_{22}}}\]
📋 Soal 1 — Penyusunan Tabel
Susun kembali tabel kontingensi 2×2 tentang hubungan kebiasaan merokok dan kanker paru.
tabel1 <- matrix(
c(688, 650,
21, 59),
nrow = 2,
byrow = TRUE
)
rownames(tabel1) <- c("Smoker", "Non-Smoker")
colnames(tabel1) <- c("Cancer (+)", "Control (-)")
addmargins(tabel1)#> Cancer (+) Control (-) Sum
#> Smoker 688 650 1338
#> Non-Smoker 21 59 80
#> Sum 709 709 1418
| Status Merokok | Cancer (+) | Control (-) | Total |
|---|---|---|---|
| Smoker | 688 | 650 | 1338 |
| Non-Smoker | 21 | 59 | 80 |
| Total | 709 | 709 | 1418 |
📋 Soal 2 — Estimasi Titik
Hitung estimasi titik proporsi kejadian kanker paru pada kelompok Smoker dan Non-Smoker.
📌 Konsep
Estimasi titik proporsi merupakan proporsi sampel \(\hat{\pi}_{1|i}\) yang menjadi penduga tak bias untuk parameter populasi \(\pi_{1|i}\): \[\hat{\pi}_{1|1} = \frac{n_{11}}{n_{1.}}, \qquad \hat{\pi}_{1|2} = \frac{n_{21}}{n_{2.}}\]
n1. <- sum(tabel1[1, ])
n2. <- sum(tabel1[2, ])
n_total <- sum(tabel1)
pi_smoker <- tabel1[1, 1] / n1.
pi_nonsmoker <- tabel1[2, 1] / n2.
cat("Smoker :", tabel1[1,1], "/", n1., "=", round(pi_smoker, 4),
"(", round(pi_smoker*100, 2), "%)\n")#> Smoker : 688 / 1338 = 0.5142 ( 51.42 %)
cat("Non-Smoker :", tabel1[2,1], "/", n2., "=", round(pi_nonsmoker, 4),
"(", round(pi_nonsmoker*100, 2), "%)\n")#> Non-Smoker : 21 / 80 = 0.2625 ( 26.25 %)
✅ Interpretasi
Proporsi kejadian kanker paru pada kelompok Smoker sebesar \(\hat{\pi}_{1|1} = 0.5142\) (51.42%), sedangkan pada kelompok Non-Smoker sebesar \(\hat{\pi}_{1|2} = 0.2625\) (26.25%). Terdapat perbedaan proporsi yang cukup besar antara kedua kelompok.
📋 Soal 3 — Interval Kepercayaan
Hitung interval kepercayaan 95% untuk proporsi masing-masing kelompok, RD, RR, dan OR.
z <- qnorm(0.975)
se_smoker <- sqrt(pi_smoker * (1 - pi_smoker) / n1.)
ci_smoker_lo <- pi_smoker - z * se_smoker
ci_smoker_hi <- pi_smoker + z * se_smoker
se_nonsmoker <- sqrt(pi_nonsmoker * (1 - pi_nonsmoker) / n2.)
ci_nonsmoker_lo <- pi_nonsmoker - z * se_nonsmoker
ci_nonsmoker_hi <- pi_nonsmoker + z * se_nonsmoker
cat("CI 95% Proporsi:\n")#> CI 95% Proporsi:
#> Smoker : [0.4874, 0.541]
cat("Non-Smoker :", paste0("[", round(ci_nonsmoker_lo, 4), ", ", round(ci_nonsmoker_hi, 4), "]"), "\n")#> Non-Smoker : [0.1661, 0.3589]
RD <- pi_smoker - pi_nonsmoker
se_RD <- sqrt(
pi_smoker * (1 - pi_smoker) / n1. +
pi_nonsmoker * (1 - pi_nonsmoker) / n2.
)
ci_RD_lo <- RD - z * se_RD
ci_RD_hi <- RD + z * se_RD
hasil_rr <- riskratio(tabel1, method = "wald", rev = "neither")
RR <- hasil_rr$measure[2, 1]
ci_RR_lo <- hasil_rr$measure[2, 2]
ci_RR_hi <- hasil_rr$measure[2, 3]
hasil_or <- fisher.test(tabel1)
OR <- hasil_or$estimate
ci_OR_lo <- hasil_or$conf.int[1]
ci_OR_hi <- hasil_or$conf.int[2]
df_ci <- data.frame(
Ukuran = c("RD (Risk Difference)", "RR (Relative Risk)", "OR (Odds Ratio)"),
Estimasi = round(c(RD, RR, OR), 4),
`CI 95% Bawah` = round(c(ci_RD_lo, ci_RR_lo, ci_OR_lo), 4),
`CI 95% Atas` = round(c(ci_RD_hi, ci_RR_hi, ci_OR_hi), 4),
check.names = FALSE
)
kable(df_ci,
caption = "Tabel 1.2 — Ringkasan Estimasi Titik dan Interval Kepercayaan 95%",
align = "c"
)| Ukuran | Estimasi | CI 95% Bawah | CI 95% Atas |
|---|---|---|---|
| RD (Risk Difference) | 0.2517 | 0.1516 | 0.3518 |
| RR (Relative Risk) | 1.5181 | 1.3173 | 1.7495 |
| OR (Odds Ratio) | 2.9716 | 1.7556 | 5.2107 |
✅ Interpretasi
📋 Soal 4 — Uji Dua Proporsi
Lakukan uji dua proporsi untuk menguji apakah proporsi kanker paru berbeda antara Smoker dan Non-Smoker.
Hipotesis dan Statistik Uji
Hipotesis: \[H_0: \pi_{1|1} = \pi_{1|2} \quad \text{vs} \quad H_1: \pi_{1|1} \neq \pi_{1|2}\]
Statistik Uji (proporsi gabungan): \[\hat{\pi}_p = \frac{n_{11} + n_{21}}{n_{..}}, \quad z = \frac{\hat{\pi}_{1|1} - \hat{\pi}_{1|2}}{\sqrt{\hat{\pi}_p(1-\hat{\pi}_p)\left(\frac{1}{n_{1.}} + \frac{1}{n_{2.}}\right)}} \sim N(0,1)\]
hasil_prop <- prop.test(
x = c(tabel1[1,1], tabel1[2,1]),
n = c(n1., n2.),
correct = FALSE,
alternative = "two.sided"
)
cat("Proporsi Smoker :", round(pi_smoker, 4), "\n")#> Proporsi Smoker : 0.5142
#> Proporsi Non-Smoker : 0.2625
#> Statistik chi2 (=z2): 19.1292
#> Nilai z : 4.3737
#> p-value : 1.2e-05
#> Keputusan : TOLAK H0
✅ Interpretasi
Dengan \(p\text{-value} \approx 1.22e-05\) yang jauh lebih kecil dari \(\alpha = 0{,}05\), kita menolak \(H_0\). Terdapat cukup bukti statistik bahwa proporsi kejadian kanker paru pada kelompok Smoker berbeda secara signifikan dari kelompok Non-Smoker.
📋 Soal 5 — Uji Chi-Square
Lakukan uji chi-square independensi untuk menguji apakah terdapat asosiasi antara status merokok dan kanker paru.
Hipotesis dan Statistik Uji
Hipotesis: \[H_0: \pi_{ij} = \pi_{i.} \times \pi_{.j} \text{ (independen)} \quad \text{vs} \quad H_1: \text{tidak independen}\]
Frekuensi harapan: \(\hat{\mu}_{ij} = \dfrac{n_{i.} \times n_{.j}}{n_{..}}\)
Statistik uji Pearson: \[X^2 = \sum_{i}\sum_{j} \frac{(n_{ij} - \hat{\mu}_{ij})^2}{\hat{\mu}_{ij}} \sim \chi^2_{(I-1)(J-1)}\]
#> Frekuensi Harapan:
#> Cancer (+) Control (-)
#> Smoker 669 669
#> Non-Smoker 40 40
kontribusi <- (tabel1 - hasil_chisq$expected)^2 / hasil_chisq$expected
cat("\nKontribusi per Sel terhadap X2:\n")#>
#> Kontribusi per Sel terhadap X2:
#> Cancer (+) Control (-)
#> Smoker 0.5396 0.5396
#> Non-Smoker 9.0250 9.0250
#>
#> X2 hitung : 19.1292
#> df : 1
#> p-value : 1.2e-05
#> Keputusan : TOLAK H0
✅ Interpretasi
Statistik uji \(X^2 = 19.1292\) dengan \(df = 1\) menghasilkan \(p\text{-value} \approx 1.22e-05\). Karena \(p < 0{,}05\), kita menolak \(H_0\). Terdapat asosiasi yang signifikan antara status merokok dan kejadian kanker paru.
📋 Soal 6 — Uji Likelihood Ratio
Lakukan uji likelihood ratio (\(G^2\)) sebagai alternatif uji chi-square.
Statistik Uji G²
\[G^2 = 2 \sum_{i}\sum_{j} n_{ij} \ln\!\left(\frac{n_{ij}}{\hat{\mu}_{ij}}\right) \sim \chi^2_{(I-1)(J-1)}\]
\(G^2\) dan \(X^2\) memiliki distribusi asimtotik yang sama, namun \(G^2\) didasarkan pada prinsip maximum likelihood.
mu_hat <- hasil_chisq$expected
G2 <- 2 * sum(tabel1 * log(tabel1 / mu_hat))
df_lr <- (nrow(tabel1) - 1) * (ncol(tabel1) - 1)
pval_lr <- pchisq(G2, df = df_lr, lower.tail = FALSE)
cat("G2 hitung :", round(G2, 4), "\n")#> G2 hitung : 19.878
#> df : 1
#> p-value : 8.3e-06
#> Keputusan : TOLAK H0
#>
#> Perbandingan: X2 = 19.1292 , G2 = 19.878
✅ Interpretasi
\(G^2 = 19.878\) dengan \(df = 1\) menghasilkan \(p\text{-value} \approx 8.25e-06\). Keputusan konsisten dengan uji chi-square: tolak \(H_0\). Nilai \(G^2\) (19.878) sangat dekat dengan \(X^2\) (19.1292), sesuai teori asimtotik yang menyatakan kedua statistik setara untuk sampel besar.
📋 Soal 7 — Fisher Exact Test
Lakukan Fisher Exact Test dan bandingkan hasilnya dengan uji sebelumnya.
Ide Fisher Exact Test
Fisher Exact Test menghitung probabilitas tepat (exact p-value) berdasarkan distribusi hipergeometrik, tanpa asumsi sampel besar. Dengan baris dan kolom marginal tetap:
\[P(\text{tabel} \mid \text{marginal tetap}) = \frac{\binom{n_{1.}}{n_{11}} \binom{n_{2.}}{n_{21}}}{\binom{n_{..}}{n_{.1}}}\]
p-value adalah jumlah peluang semua tabel yang sama ekstrem atau lebih ekstrem dari yang diamati.
hasil_fisher <- fisher.test(tabel1, alternative = "two.sided")
cat("OR estimasi :", round(hasil_fisher$estimate, 4), "\n")#> OR estimasi : 2.9716
cat("95% CI OR :", paste0("[", round(hasil_fisher$conf.int[1], 4), ", ", round(hasil_fisher$conf.int[2], 4), "]"), "\n")#> 95% CI OR : [1.7556, 5.2107]
#> p-value : 1.5e-05
#> Keputusan : TOLAK H0
✅ Interpretasi
Fisher Exact Test memberikan \(p\text{-value} = 1.48e-05\) dan OR = 2.9716 dengan 95% CI = [1.7556, 5.2107]. Karena CI tidak mencakup 1 dan \(p < 0{,}05\), kita menolak \(H_0\).
📋 Soal 8 — Perbandingan Keempat Uji
Bandingkan hasil uji dua proporsi, chi-square, likelihood ratio, dan Fisher exact test dari sisi hipotesis, statistik uji, p-value, keputusan, dan interpretasi substantif.
| Aspek | Uji 2 Proporsi | Chi-Square | Likelihood Ratio | Fisher Exact |
|---|---|---|---|---|
| Hipotesis H0 | pi1|1 = pi1|2 | Independensi | Independensi | OR = 1 (independen) |
| Statistik Uji | z (atau chi2 = z2) | X2 Pearson | G2 | Distribusi Hipergeometrik |
| Nilai Statistik | chi2 = 19.1292 | 19.1292 | 19.878 | OR = 2.9716 |
| Distribusi | N(0,1) atau chi2(1) | chi2(1) | chi2(1) | Hipergeometrik (exact) |
| p-value | 1.22e-05 | 1.22e-05 | 8.25e-06 | 1.48e-05 |
| Keputusan (a=0,05) | Tolak H0 | Tolak H0 | Tolak H0 | Tolak H0 |
💡 Diskusi Perbandingan Uji
Keempat uji memberikan kesimpulan yang konsisten: terdapat asosiasi signifikan antara merokok dan kanker paru. Perbedaannya terletak pada:
• Uji 2 Proporsi dan Chi-Square secara matematis ekuivalen (\(X^2 = z^2\)) — keduanya menggunakan pendekatan frekuensi harapan yang sama.
• Likelihood Ratio (\(G^2\)) memberikan nilai yang sangat dekat dengan \(X^2\) (19.878 vs 19.1292) karena sampel besar (\(n=1418\)) — kedua statistik konvergen secara asimtotik.
• Fisher Exact Test tidak memerlukan asumsi sampel besar dan memberikan nilai exact. Untuk data besar ini hasilnya konsisten dengan ketiga uji lainnya.
par(mfrow = c(1, 3), mar = c(4, 4, 3, 1), family = "sans")
barplot(
height = c(pi_smoker, pi_nonsmoker),
names.arg = c("Smoker\n(51.42%)", "Non-Smoker\n(26.25%)"),
col = c("#f43f5e", "#0d9488"),
ylim = c(0, 0.75),
main = "Proporsi Kanker Paru",
ylab = "P(Cancer | Status Merokok)",
border = NA, las = 1, cex.main = 1
)
abline(h = 0, col = "gray70")
mosaicplot(
tabel1,
color = c("#f43f5e", "#0d9488"),
main = "Mosaic Plot",
xlab = "Status Merokok",
ylab = "Diagnosis",
cex.axis = 0.85,
border = "white"
)
plot(
x = OR, y = 1,
pch = 15, col = "#7c3aed", cex = 2.5,
xlim = c(0, 8), ylim = c(0.5, 1.5),
yaxt = "n",
main = sprintf("Odds Ratio + 95%% CI\nOR = %.4f", OR),
xlab = "Odds Ratio", ylab = "",
las = 1, cex.main = 1
)
arrows(ci_OR_lo, 1, ci_OR_hi, 1,
angle = 90, code = 3, length = 0.1,
col = "#7c3aed", lwd = 2.5)
abline(v = 1, col = "#f43f5e", lty = 2, lwd = 1.5)
text(1.3, 1.35, "H0: OR=1", col = "#f43f5e", cex = 0.75)
text(OR, 0.7,
sprintf("[%.2f, %.2f]", ci_OR_lo, ci_OR_hi),
col = "#7c3aed", cex = 0.8)Visualisasi Kasus 1: Proporsi, Mosaic Plot, dan Ukuran Asosiasi
📋 Soal 9 — Kesimpulan Akhir
Buat kesimpulan akhir tentang hubungan antara merokok dan kanker paru.
🎯 Kesimpulan Kasus 1
Berdasarkan seluruh analisis pada data 1418 subjek (709 kasus kanker, 709 kontrol), diperoleh kesimpulan:
1. Terdapat asosiasi positif yang kuat dan signifikan antara kebiasaan merokok dan kejadian kanker paru. Keempat metode uji hipotesis (uji dua proporsi, chi-square, likelihood ratio, dan Fisher exact test) secara konsisten menolak \(H_0\) independensi dengan \(p\text{-value}\) yang sangat kecil.
2. Besaran asosiasi: RD = 0.2517 (Smoker 25,17% lebih berisiko secara absolut), RR = 1.5181 (Smoker 1.52 kali lebih berisiko), OR = 2.9716 (odds Smoker 2.97 kali lebih besar). Ketiga ukuran konsisten menunjukkan asosiasi positif kuat.
3. Secara substantif: Temuan ini konsisten dengan bukti epidemiologi global bahwa merokok merupakan faktor risiko utama kanker paru. Perokok memiliki risiko kanker paru hampir 2 kali lipat dibanding bukan perokok dalam studi case-control ini.
📋 Soal 1 — Penyusunan Tabel
Susun kembali tabel kontingensi 2×3 tentang hubungan gender dan identifikasi partai politik.
tabel2 <- matrix(
c(495, 272, 590,
330, 265, 498),
nrow = 2,
byrow = TRUE
)
rownames(tabel2) <- c("Female", "Male")
colnames(tabel2) <- c("Democrat", "Republican", "Independent")
addmargins(tabel2)#> Democrat Republican Independent Sum
#> Female 495 272 590 1357
#> Male 330 265 498 1093
#> Sum 825 537 1088 2450
| Gender | Democrat | Republican | Independent | Total |
|---|---|---|---|---|
| Female | 495 | 272 | 590 | 1357 |
| Male | 330 | 265 | 498 | 1093 |
| Total | 825 | 537 | 1088 | 2450 |
📋 Soal 2 — Frekuensi Harapan
Hitung frekuensi harapan untuk setiap sel.
Rumus Frekuensi Harapan
\[\hat{\mu}_{ij} = \frac{n_{i.} \times n_{.j}}{n_{..}}\]
Frekuensi harapan merepresentasikan nilai yang diharapkan di setiap sel jika \(H_0\) (independensi) benar.
chisq_k2 <- chisq.test(tabel2, correct = FALSE)
mu_hat_k2 <- chisq_k2$expected
cat("Frekuensi Harapan:\n")#> Frekuensi Harapan:
#> Democrat Republican Independent
#> Female 456.949 297.4322 602.6188
#> Male 368.051 239.5678 485.3812
cat("\nMin mu_hat =", round(min(mu_hat_k2), 4), "->",
ifelse(min(mu_hat_k2) >= 5, "Syarat Cochran terpenuhi", "Gunakan Fisher/simulasi"), "\n")#>
#> Min mu_hat = 239.5678 -> Syarat Cochran terpenuhi
| Gender | Democrat | Republican | Independent | |
|---|---|---|---|---|
| Female | Female | 456.949 | 297.4322 | 602.6188 |
| Male | Male | 368.051 | 239.5678 | 485.3812 |
✅ Interpretasi
Seluruh frekuensi harapan \(\hat{\mu}_{ij} \geq 5\) (minimum = 239.57), sehingga syarat Cochran terpenuhi dan uji chi-square asimtotik dapat digunakan dengan valid.
📋 Soal 3 — Uji Chi-Square Keseluruhan
Lakukan uji chi-square independensi untuk tabel keseluruhan.
#> X2 hitung : 12.5693
#> df : 2 [(2-1)x(3-1)]
#> p-value : 0.0019
#> Keputusan : TOLAK H0
✅ Interpretasi
\(X^2 = 12.5693\) dengan \(df = 2\) menghasilkan \(p\text{-value} = 0.0019\). Karena \(p < 0{,}05\), tolak \(H_0\) — terdapat asosiasi signifikan antara gender dan identifikasi partai politik.
📋 Soal 4 — Residual Pearson
Hitung dan interpretasikan residual Pearson atau standardized residual untuk mengidentifikasi sel yang paling berkontribusi.
Rumus Residual
Residual Pearson: \(r_{ij} = \dfrac{n_{ij} - \hat{\mu}_{ij}}{\sqrt{\hat{\mu}_{ij}}}\)
Standardized Residual (Adjusted): \(d_{ij} = \dfrac{n_{ij} - \hat{\mu}_{ij}}{\sqrt{\hat{\mu}_{ij}(1 - \hat{p}_{i.})(1 - \hat{p}_{.j})}}\)
Aturan: \(|d_{ij}| > 2\) → sel berkontribusi signifikan; \(|d_{ij}| > 3\) → sangat signifikan.
#> Residual Pearson (rij):
#> Democrat Republican Independent
#> Female 1.7801 -1.4747 -0.5140
#> Male -1.9834 1.6431 0.5728
#>
#> Standardized Residual (dij):
#> Democrat Republican Independent
#> Female 3.2724 -2.4986 -1.0322
#> Male -3.2724 2.4986 1.0322
#>
#> Kontribusi Sel terhadap X2:
#> Democrat Republican Independent
#> Female 3.1686 2.1746 0.2642
#> Male 3.9339 2.6999 0.3281
#>
#> Total X2 = 12.5693 (verifikasi: 12.5693 )
res_df <- as.data.frame(as.table(round(res_std, 3)))
names(res_df) <- c("Gender", "Partai", "Residual")
ggplot(res_df, aes(x = Partai, y = Gender, fill = Residual)) +
geom_tile(color = "white", linewidth = 1.5) +
geom_text(aes(label = sprintf("%.3f", Residual)),
size = 5, fontface = "bold",
color = ifelse(abs(res_df$Residual) > 1.5, "white", "#1e293b")) +
scale_fill_gradient2(
low = "#f43f5e",
mid = "#f8fafc",
high = "#0d9488",
midpoint = 0,
limits = c(-4, 4),
name = "Std. Residual"
) +
labs(
title = "Standardized Residual per Sel",
subtitle = "Merah = lebih rendah dari harapan | Hijau = lebih tinggi dari harapan",
x = "Identifikasi Partai Politik",
y = "Gender"
) +
theme_minimal(base_family = "sans") +
theme(
plot.title = element_text(face = "bold", size = 14, color = "#1a2744"),
plot.subtitle = element_text(size = 11, color = "#64748b"),
axis.text = element_text(size = 12, color = "#334155"),
panel.grid = element_blank()
)Heatmap Standardized Residual — Kasus 2
✅ Interpretasi Residual
Dari standardized residual: - Female–Democrat (\(d = 3.272\)): Female lebih banyak dari yang diharapkan mengidentifikasi sebagai Democrat (signifikan, \(|d|>2\)). - Female–Republican (\(d = -2.499\)): Female lebih sedikit dari yang diharapkan mengidentifikasi sebagai Republican (signifikan). - Male–Democrat (\(d = -3.272\)): Male lebih sedikit dari yang diharapkan mengidentifikasi sebagai Democrat. - Male–Republican (\(d = 2.499\)): Male lebih banyak dari yang diharapkan mengidentifikasi sebagai Republican. - Independent relatif lebih kecil residualnya, menandakan distribusinya lebih seimbang antar gender.
📋 Soal 5 — Partisi Chi-Square
Lakukan partisi chi-square untuk (a) Democrat vs Republican, dan (b) (Democrat + Republican) vs Independent.
Konsep Partisi Chi-Square
Untuk tabel \(I \times J\), chi-square total dengan \(df = (I-1)(J-1)\) dapat dipartisi menjadi komponen-komponen yang aditif: \[X^2_{\text{total}} = X^2_{\text{partisi 1}} + X^2_{\text{partisi 2}} + \cdots\] \[df_{\text{total}} = df_1 + df_2 + \cdots\]
Dengan \(df_1 + df_2 = (I-1)(J-1) = 2\) untuk tabel 2×3.
tabel_DR <- tabel2[, 1:2]
chisq_DR <- chisq.test(tabel_DR, correct = FALSE)
cat("PARTISI 1: Democrat vs Republican\n")#> PARTISI 1: Democrat vs Republican
#> Democrat Republican Sum
#> Female 495 272 767
#> Male 330 265 595
#> Sum 825 537 1362
cat("\nX2_1 =", round(chisq_DR$statistic, 4), ", df =", chisq_DR$parameter,
", p-value =", round(chisq_DR$p.value, 4), "\n")#>
#> X2_1 = 11.5545 , df = 1 , p-value = 7e-04
tabel_DRI <- cbind(rowSums(tabel2[, 1:2]), tabel2[, 3])
rownames(tabel_DRI) <- rownames(tabel2)
colnames(tabel_DRI) <- c("Dem+Rep", "Independent")
chisq_DRI <- chisq.test(tabel_DRI, correct = FALSE)
cat("\nPARTISI 2: (Dem+Rep) vs Independent\n")#>
#> PARTISI 2: (Dem+Rep) vs Independent
#> Dem+Rep Independent Sum
#> Female 767 590 1357
#> Male 595 498 1093
#> Sum 1362 1088 2450
cat("\nX2_2 =", round(chisq_DRI$statistic, 4), ", df =", chisq_DRI$parameter,
", p-value =", round(chisq_DRI$p.value, 4), "\n")#>
#> X2_2 = 1.0654 , df = 1 , p-value = 0.302
#>
#> Ringkasan Partisi:
#> X2_1 (Dem vs Rep) : 11.5545 (df = 1 )
#> X2_2 (Dem+Rep vs Ind) : 1.0654 (df = 1 )
cat("Total Partisi :", round(chisq_DR$statistic + chisq_DRI$statistic, 4),
" (df =", chisq_DR$parameter + chisq_DRI$parameter, ")\n")#> Total Partisi : 12.62 (df = 2 )
#> X2 Keseluruhan : 12.5693 (df = 2 )
📋 Soal 6 — Perbandingan Partisi vs Keseluruhan
Bandingkan hasil partisi dengan hasil uji chi-square keseluruhan.
| Perbandingan | X2 | df | p-value | Keputusan |
|---|---|---|---|---|
| Partisi 1: Democrat vs Republican | 11.5545 | 1 | 0.0007 | Tolak H0 |
| Partisi 2: (Dem+Rep) vs Independent | 1.0654 | 1 | 0.3020 | Gagal Tolak |
| Total Partisi | 12.6200 | 2 | NA | - |
| Chi-Square Keseluruhan (2x3) | 12.5693 | 2 | 0.0019 | Tolak H0 |
✅ Interpretasi Perbandingan
📋 Soal 7 — Identifikasi Kategori Paling Berkontribusi
Jelaskan kategori mana yang paling berkontribusi terhadap hubungan antara Gender dan Identifikasi Partai Politik.
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
contrib_vec <- as.vector(t(contrib))
names(contrib_vec) <- paste0(
rep(c("F-", "M-"), each = 3),
rep(c("Dem", "Rep", "Ind"), 2)
)
warna_contrib <- ifelse(contrib_vec == max(contrib_vec), "#f43f5e",
ifelse(contrib_vec >= sort(contrib_vec, decreasing=TRUE)[2],
"#f59e0b", "#0d9488"))
barplot(
contrib_vec,
col = warna_contrib,
main = "Kontribusi Sel\nterhadap X2",
ylab = "Kontribusi (r2ij)",
las = 2, border = NA, cex.names = 0.85, cex.main = 1
)
abline(h = 0, col = "gray70")
mosaicplot(
tabel2,
color = c("#db2777", "#7c3aed", "#0d9488"),
main = "Mosaic Plot\nGender x Partai",
xlab = "Gender",
shade = TRUE,
border = "white", cex.axis = 0.85
)Kontribusi Sel terhadap Chi-Square dan Mosaic Plot — Kasus 2
🎯 Kategori Paling Berkontribusi
Berdasarkan standardized residual dan kontribusi terhadap \(X^2\):
1. Democrat adalah partai yang paling menunjukkan perbedaan gender. Female secara signifikan over-represented di Democrat (\(d_{F,Dem} = 3.272\), \(|d|>2\)) dan Male under-represented (\(d_{M,Dem} = -3.272\)).
2. Republican juga menunjukkan pola sebaliknya: Female under-represented (\(d_{F,Rep} = -2.499\)) dan Male over-represented (\(d_{M,Rep} = 2.499\)).
3. Independent memiliki residual yang paling kecil, artinya distribusi gender di kelompok Independent relatif paling proporsional dan berkontribusi paling sedikit terhadap ketidakindependensian.
Secara substantif: perbedaan gender dalam afiliasi partai paling kentara pada Democrat dan Republican, di mana wanita cenderung lebih condong ke Democrat sedangkan pria lebih condong ke Republican.
🎯 Kesimpulan Laporan
Kasus 1 (Merokok & Kanker Paru): Terdapat asosiasi positif yang kuat dan signifikan antara kebiasaan merokok dan kejadian kanker paru (\(X^2 = 19.1292\), \(p \approx 0\)). Perokok memiliki RR = 1.5181 (hampir 2x lebih berisiko) dan OR = 2.9716. Keempat uji hipotesis memberikan keputusan yang konsisten.
Kasus 2 (Gender & Partai Politik): Terdapat asosiasi signifikan antara gender dan identifikasi partai politik (\(X^2 = 12.5693\), \(p = 0.0019\)). Partisi chi-square menunjukkan bahwa asosiasi ini terutama didorong oleh perbedaan preferensi Democrat vs Republican antar gender, bukan oleh kategori Independent.
Dataset yang digunakan adalah TitanicSurvival
(package carData), berisi data penumpang kapal Titanic. Variabel respons
adalah survived (biner: yes/no) dengan referensi
no, dan prediktor yang digunakan adalah
age (umur), sex (jenis kelamin), dan
passengerClass (kelas tiket).
library(carData)
data(TitanicSurvival)
data_biner <- TitanicSurvival %>%
dplyr::select(survived, age, sex, passengerClass) %>%
na.omit()
data_biner <- data_biner %>%
mutate(
survived = factor(survived, levels = c("no", "yes")),
sex = factor(sex, levels = c("male", "female")),
passengerClass = factor(passengerClass, levels = c("3rd", "2nd", "1st"))
)
dplyr::glimpse(data_biner)#> Rows: 1,046
#> Columns: 4
#> $ survived <fct> yes, yes, no, no, no, yes, yes, no, yes, no, no, yes, y…
#> $ age <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000, 48.0000, 63.…
#> $ sex <fct> female, male, female, male, female, male, female, male,…
#> $ passengerClass <fct> 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, …
data_biner %>%
count(survived) %>%
mutate(proporsi = n / sum(n)) %>%
kable(digits = 3, caption = "Distribusi status keselamatan penumpang Titanic") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE)| survived | n | proporsi |
|---|---|---|
| no | 619 | 0.592 |
| yes | 427 | 0.408 |
data_biner %>%
count(survived) %>%
mutate(proporsi = n / sum(n)) %>%
ggplot(aes(x = survived, y = proporsi, fill = survived)) +
geom_col(width = 0.65, alpha = 0.94) +
geom_text(aes(label = percent(proporsi, accuracy = 0.1)),
vjust = -0.4, fontface = "bold", color = "#243142") +
scale_y_continuous(labels = percent_format(), limits = c(0, 0.8)) +
labs(
title = "Distribusi Status Keselamatan Penumpang Titanic",
subtitle = "Respons biner: selamat (yes) atau tidak (no).",
x = NULL,
y = "Proporsi"
) +
theme_minimal() +
theme(legend.position = "none")fit_biner <- glm(
survived ~ age + sex + passengerClass,
data = data_biner,
family = binomial(link = "logit")
)
summary(fit_biner)#>
#> Call:
#> glm(formula = survived ~ age + sex + passengerClass, family = binomial(link = "logit"),
#> data = data_biner)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.265431 0.202868 -6.238 4.44e-10 ***
#> age -0.034393 0.006331 -5.433 5.56e-08 ***
#> sexfemale 2.497845 0.166037 15.044 < 2e-16 ***
#> passengerClass2nd 1.009091 0.198363 5.087 3.64e-07 ***
#> passengerClass1st 2.289661 0.225802 10.140 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1414.62 on 1045 degrees of freedom
#> Residual deviance: 982.45 on 1041 degrees of freedom
#> AIC: 992.45
#>
#> Number of Fisher Scoring iterations: 4
biner_coef <- as.data.frame(coef(summary(fit_biner))) %>%
tibble::rownames_to_column("parameter") %>%
dplyr::rename(
estimate = Estimate,
std_error = `Std. Error`,
z_value = `z value`,
p_value = `Pr(>|z|)`
) %>%
mutate(
OR = exp(estimate),
CI_low = exp(estimate - 1.96 * std_error),
CI_high = exp(estimate + 1.96 * std_error),
perubahan_persen = 100 * (OR - 1)
)
biner_coef %>%
mutate(across(c(estimate, std_error, z_value, p_value,
OR, CI_low, CI_high, perubahan_persen), ~ round(.x, 4))) %>%
kable(caption = "Ringkasan hasil regresi logistik biner (referensi: tidak selamat)",
col.names = c("Parameter", "Estimate", "SE", "z-value", "p-value",
"OR", "CI 2.5%", "CI 97.5%", "Perubahan (%)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = TRUE)| Parameter | Estimate | SE | z-value | p-value | OR | CI 2.5% | CI 97.5% | Perubahan (%) |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | -1.2654 | 0.2029 | -6.2377 | 0 | 0.2821 | 0.1896 | 0.4199 | -71.7882 |
| age | -0.0344 | 0.0063 | -5.4325 | 0 | 0.9662 | 0.9543 | 0.9783 | -3.3809 |
| sexfemale | 2.4978 | 0.1660 | 15.0439 | 0 | 12.1563 | 8.7794 | 16.8319 | 1115.6265 |
| passengerClass2nd | 1.0091 | 0.1984 | 5.0871 | 0 | 2.7431 | 1.8595 | 4.0466 | 174.3106 |
| passengerClass1st | 2.2897 | 0.2258 | 10.1401 | 0 | 9.8716 | 6.3413 | 15.3672 | 887.1586 |
✅ Interpretasi Tabel Koefisien
Berikut interpretasi prediktor yang signifikan (p-value < 0,05):
pred_prob_biner <- predict(fit_biner, type = "response")
data_biner_pred <- data_biner %>%
mutate(
prob_selamat = pred_prob_biner,
prediksi = ifelse(prob_selamat >= 0.5, "yes", "no"),
prediksi = factor(prediksi, levels = c("no", "yes"))
)
head(data_biner_pred)| survived | age | sex | passengerClass | prob_selamat | prediksi | |
|---|---|---|---|---|---|---|
| Allen, Miss. Elisabeth Walton | yes | 29.0000 | female | 1st | 0.9258533 | yes |
| Allison, Master. Hudson Trevor | yes | 0.9167 | male | 1st | 0.7296211 | yes |
| Allison, Miss. Helen Loraine | no | 2.0000 | female | 1st | 0.9693290 | yes |
| Allison, Mr. Hudson Joshua Crei | no | 30.0000 | male | 1st | 0.4981081 | no |
| Allison, Mrs. Hudson J C (Bessi | no | 25.0000 | female | 1st | 0.9347616 | yes |
| Anderson, Mr. Harry | yes | 48.0000 | male | 1st | 0.3482715 | no |
conf_biner <- table(
Aktual = data_biner_pred$survived,
Prediksi = data_biner_pred$prediksi
)
conf_biner#> Prediksi
#> Aktual no yes
#> no 520 99
#> yes 126 301
#> Akurasi: 0.7849
grid_biner <- expand.grid(
age = seq(min(data_biner$age), max(data_biner$age), length.out = 100),
sex = factor("female", levels = c("male", "female")),
passengerClass = factor("1st", levels = c("3rd", "2nd", "1st"))
)
pred_biner <- predict(fit_biner, newdata = grid_biner, type = "link", se.fit = TRUE)
grid_biner <- grid_biner %>%
mutate(
prob_selamat = plogis(pred_biner$fit),
CI_low = plogis(pred_biner$fit - 1.96 * pred_biner$se.fit),
CI_high = plogis(pred_biner$fit + 1.96 * pred_biner$se.fit)
)
ggplot(grid_biner, aes(x = age)) +
geom_ribbon(aes(ymin = CI_low, ymax = CI_high), fill = "#5568b8", alpha = 0.2) +
geom_line(aes(y = prob_selamat), linewidth = 1.35, color = "#5568b8") +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Prediksi Probabilitas Selamat Penumpang Titanic",
subtitle = "sex = female, passengerClass = 1st (ditahan pada nilai rata-rata).",
x = "Umur (age)",
y = "Probabilitas selamat"
)Dataset yang digunakan adalah TravelMode (package
AER), berisi data pilihan moda transportasi 210 individu dalam
perjalanan antara Sydney dan Melbourne, Australia. Variabel respons
adalah mode (nominal: car, air, train, bus) dengan kategori
referensi car, dan prediktor yang digunakan adalah
income (pendapatan rumah tangga) dan size
(ukuran rombongan).
library(tidyverse)
library(nnet)
library(knitr)
library(kableExtra)
library(scales)
library(AER)
data(TravelMode)
# Filter hanya baris choice == "yes" (pilihan aktual tiap individu)
# lalu pilih kolom yang digunakan saja
data_multi <- TravelMode[TravelMode$choice == "yes", c("mode", "income", "size")]
data_multi <- data_multi %>%
mutate(mode = factor(mode, levels = c("car", "air", "train", "bus")))
dplyr::glimpse(data_multi)#> Rows: 210
#> Columns: 3
#> $ mode <fct> car, car, car, car, car, train, air, car, car, car, car, car, c…
#> $ income <int> 35, 30, 40, 70, 45, 20, 45, 12, 40, 70, 15, 35, 50, 40, 26, 26,…
#> $ size <int> 1, 2, 1, 3, 2, 1, 1, 1, 1, 2, 2, 2, 4, 1, 4, 1, 1, 1, 1, 2, 1, …
data_multi %>%
count(mode) %>%
mutate(proporsi = n / sum(n)) %>%
kable(digits = 3, caption = "Distribusi pilihan moda transportasi") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE)| mode | n | proporsi |
|---|---|---|
| car | 59 | 0.281 |
| air | 58 | 0.276 |
| train | 63 | 0.300 |
| bus | 30 | 0.143 |
data_multi %>%
count(mode) %>%
mutate(proporsi = n / sum(n)) %>%
ggplot(aes(x = mode, y = proporsi, fill = mode)) +
geom_col(width = 0.65, alpha = 0.94) +
geom_text(aes(label = percent(proporsi, accuracy = 0.1)),
vjust = -0.4, fontface = "bold", color = "#243142") +
scale_y_continuous(labels = percent_format(), limits = c(0, 0.65)) +
labs(
title = "Distribusi Pilihan Moda Transportasi",
subtitle = "Respons multinomial: kategori tidak memiliki urutan alami.",
x = NULL,
y = "Proporsi"
) +
theme_minimal() +
theme(legend.position = "none")fit_multi <- nnet::multinom(
mode ~ income + size,
data = data_multi,
trace = FALSE
)
summary(fit_multi)#> Call:
#> nnet::multinom(formula = mode ~ income + size, data = data_multi,
#> trace = FALSE)
#>
#> Coefficients:
#> (Intercept) income size
#> air 0.9434269 0.003544083 -0.6005292
#> train 2.4938521 -0.057308055 -0.3098200
#> bus 1.9779430 -0.030324700 -0.9404211
#>
#> Std. Errors:
#> (Intercept) income size
#> air 0.5498451 0.01030471 0.1991981
#> train 0.5357206 0.01184160 0.1955605
#> bus 0.6717164 0.01322282 0.3244556
#>
#> Residual Deviance: 506.6817
#> AIC: 524.6817
multi_sum <- summary(fit_multi)
coef_multi <- as.data.frame(multi_sum$coefficients)
se_multi <- as.data.frame(multi_sum$standard.errors)
coef_long <- coef_multi %>%
tibble::rownames_to_column("kategori") %>%
tidyr::pivot_longer(cols = -kategori, names_to = "variabel", values_to = "estimate")
se_long <- se_multi %>%
tibble::rownames_to_column("kategori") %>%
tidyr::pivot_longer(cols = -kategori, names_to = "variabel", values_to = "std_error")
result_multi <- coef_long %>%
left_join(se_long, by = c("kategori", "variabel")) %>%
mutate(
z_value = estimate / std_error,
p_value = 2 * (1 - pnorm(abs(z_value))),
RRR = exp(estimate),
CI_low = exp(estimate - 1.96 * std_error),
CI_high = exp(estimate + 1.96 * std_error)
)
result_multi %>%
mutate(across(c(estimate, std_error, z_value, p_value, RRR, CI_low, CI_high), ~ round(.x, 4))) %>%
kable(caption = "Ringkasan hasil regresi logistik multinomial (referensi: car)",
col.names = c("Kategori", "Variabel", "Estimate", "SE", "z-value",
"p-value", "RRR", "CI 2.5%", "CI 97.5%")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = TRUE)| Kategori | Variabel | Estimate | SE | z-value | p-value | RRR | CI 2.5% | CI 97.5% |
|---|---|---|---|---|---|---|---|---|
| air | (Intercept) | 0.9434 | 0.5498 | 1.7158 | 0.0862 | 2.5688 | 0.8744 | 7.5468 |
| air | income | 0.0035 | 0.0103 | 0.3439 | 0.7309 | 1.0036 | 0.9835 | 1.0240 |
| air | size | -0.6005 | 0.1992 | -3.0147 | 0.0026 | 0.5485 | 0.3712 | 0.8105 |
| train | (Intercept) | 2.4939 | 0.5357 | 4.6551 | 0.0000 | 12.1078 | 4.2369 | 34.6004 |
| train | income | -0.0573 | 0.0118 | -4.8396 | 0.0000 | 0.9443 | 0.9226 | 0.9665 |
| train | size | -0.3098 | 0.1956 | -1.5843 | 0.1131 | 0.7336 | 0.5000 | 1.0762 |
| bus | (Intercept) | 1.9779 | 0.6717 | 2.9446 | 0.0032 | 7.2279 | 1.9375 | 26.9641 |
| bus | income | -0.0303 | 0.0132 | -2.2934 | 0.0218 | 0.9701 | 0.9453 | 0.9956 |
| bus | size | -0.9404 | 0.3245 | -2.8985 | 0.0038 | 0.3905 | 0.2067 | 0.7375 |
✅ Interpretasi Tabel Koefisien
Kategori referensi adalah car. Berikut interpretasi prediktor yang signifikan (p-value < 0,05):
pred_prob_multi <- predict(fit_multi, type = "probs")
data_multi_pred <- data_multi %>%
bind_cols(as.data.frame(pred_prob_multi)) %>%
mutate(prediksi = predict(fit_multi, type = "class"))
head(data_multi_pred)| mode | income | size | car | air | train | bus | prediksi | |
|---|---|---|---|---|---|---|---|---|
| 4 | car | 35 | 1 | 0.2097901 | 0.3346375 | 0.2507252 | 0.2048473 | air |
| 8 | car | 30 | 2 | 0.2881105 | 0.2476546 | 0.3364045 | 0.1278303 | train |
| 12 | car | 40 | 1 | 0.2293548 | 0.3723860 | 0.2058154 | 0.1924438 | air |
| 16 | car | 70 | 3 | 0.5947585 | 0.3231384 | 0.0514692 | 0.0306339 | car |
| 20 | car | 45 | 2 | 0.3728105 | 0.3379583 | 0.1842733 | 0.1049579 | car |
| 22 | train | 20 | 1 | 0.1454638 | 0.2200174 | 0.4106734 | 0.2238453 | train |
conf_multi <- table(
Aktual = data_multi_pred$mode,
Prediksi = data_multi_pred$prediksi
)
conf_multi#> Prediksi
#> Aktual car air train bus
#> car 29 12 18 0
#> air 16 23 19 0
#> train 12 5 46 0
#> bus 3 11 16 0
#> Akurasi: 0.4667
grid_multi <- expand.grid(
income = seq(min(data_multi$income), max(data_multi$income), length.out = 100),
size = mean(data_multi$size)
)
grid_prob_multi <- predict(fit_multi, newdata = grid_multi, type = "probs")
grid_multi_plot <- grid_multi %>%
bind_cols(as.data.frame(grid_prob_multi)) %>%
tidyr::pivot_longer(
cols = c("car", "air", "train", "bus"),
names_to = "mode",
values_to = "probabilitas"
)
ggplot(grid_multi_plot, aes(x = income, y = probabilitas, color = mode)) +
geom_line(linewidth = 1.35) +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Prediksi Probabilitas Pilihan Moda Transportasi",
subtitle = "Ukuran rombongan (size) ditahan pada nilai rata-rata.",
x = "Pendapatan rumah tangga (income)",
y = "Probabilitas prediksi",
color = "Moda transportasi"
)✅ Interpretasi Grafik
Grafik di atas membantu membaca hasil model secara lebih substantif.
Dataset yang digunakan adalah politicalInformation
(package pscl), berisi data 1807 responden survei pemilu Amerika tahun
2000. Variabel respons adalah y (tingkat pengetahuan
politik: Very Low < Fairly Low < Average < Fairly High <
Very High) dan prediktor yang digunakan adalah age (umur),
collegeDegree (gelar sarjana), dan female
(jenis kelamin).
library(MASS)
library(pscl)
data(politicalInformation)
# Hapus baris dengan missing value di kolom y (7 observasi)
data_ord <- politicalInformation[politicalInformation$y != "" &
!is.na(politicalInformation$y),
c("y", "collegeDegree", "female", "age")]
data_ord <- data_ord %>%
mutate(
y = ordered(y, levels = c("Very Low", "Fairly Low", "Average",
"Fairly High", "Very High")),
collegeDegree = factor(collegeDegree, levels = c("No", "Yes")),
female = factor(female, levels = c("No", "Yes"))
) %>%
na.omit()
dplyr::glimpse(data_ord)#> Rows: 1,792
#> Columns: 4
#> $ y <ord> Fairly High, Average, Very High, Average, Fairly High, A…
#> $ collegeDegree <fct> Yes, No, No, No, Yes, No, No, Yes, Yes, No, No, No, No, …
#> $ female <fct> No, Yes, Yes, No, Yes, No, No, Yes, Yes, Yes, Yes, No, N…
#> $ age <dbl> 49, 35, 57, 63, 40, 77, 43, 47, 26, 48, 41, 41, 37, 18, …
data_ord %>%
count(y) %>%
mutate(proporsi = n / sum(n)) %>%
kable(digits = 3, caption = "Distribusi tingkat pengetahuan politik") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE)| y | n | proporsi |
|---|---|---|
| Very Low | 105 | 0.059 |
| Fairly Low | 332 | 0.185 |
| Average | 585 | 0.326 |
| Fairly High | 447 | 0.249 |
| Very High | 323 | 0.180 |
fit_ord <- MASS::polr(
y ~ age + collegeDegree + female,
data = data_ord,
Hess = TRUE,
method = "logistic"
)
summary(fit_ord)#> Call:
#> MASS::polr(formula = y ~ age + collegeDegree + female, data = data_ord,
#> Hess = TRUE, method = "logistic")
#>
#> Coefficients:
#> Value Std. Error t value
#> age 0.01579 0.002548 6.198
#> collegeDegreeYes 1.54808 0.094011 16.467
#> femaleYes -0.64757 0.087733 -7.381
#>
#> Intercepts:
#> Value Std. Error t value
#> Very Low|Fairly Low -2.0677 0.1644 -12.5809
#> Fairly Low|Average -0.3103 0.1451 -2.1382
#> Average|Fairly High 1.3350 0.1493 8.9392
#> Fairly High|Very High 2.7582 0.1594 17.3066
#>
#> Residual Deviance: 5004.976
#> AIC: 5018.976
coef_full <- as.data.frame(coef(summary(fit_ord))) %>%
tibble::rownames_to_column("variabel") %>%
rename(estimate_polr = Value, std_error = `Std. Error`, z_value = `t value`) %>%
mutate(
p_value = 2 * (1 - pnorm(abs(z_value))),
jenis = ifelse(variabel %in% names(fit_ord$zeta), "Cutpoint", "Koefisien"),
estimate_slide = ifelse(jenis == "Koefisien", -estimate_polr, estimate_polr),
OR_polr = ifelse(jenis == "Koefisien", exp(estimate_polr), NA),
OR_slide = ifelse(jenis == "Koefisien", exp(estimate_slide), NA),
CI_polr_low = ifelse(jenis == "Koefisien", exp(estimate_polr - 1.96 * std_error), NA),
CI_polr_high = ifelse(jenis == "Koefisien", exp(estimate_polr + 1.96 * std_error), NA)
)
coef_full %>%
mutate(across(c(estimate_polr, std_error, z_value, p_value,
estimate_slide, OR_polr, OR_slide, CI_polr_low, CI_polr_high),
~ round(.x, 4))) %>%
kable(caption = "Ringkasan hasil regresi logistik ordinal",
col.names = c("Parameter", "Estimate polr", "SE", "z/t-value", "p-value",
"Jenis", "Estimate slide", "OR polr", "OR slide",
"CI polr 2.5%", "CI polr 97.5%")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = TRUE)| Parameter | Estimate polr | SE | z/t-value | p-value | Jenis | Estimate slide | OR polr | OR slide | CI polr 2.5% | CI polr 97.5% |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0.0158 | 0.0025 | 6.1978 | 0.0000 | Koefisien | -0.0158 | 1.0159 | 0.9843 | 1.0109 | 1.0210 |
| collegeDegreeYes | 1.5481 | 0.0940 | 16.4671 | 0.0000 | Koefisien | -1.5481 | 4.7024 | 0.2127 | 3.9111 | 5.6539 |
| femaleYes | -0.6476 | 0.0877 | -7.3812 | 0.0000 | Koefisien | 0.6476 | 0.5233 | 1.9109 | 0.4406 | 0.6215 |
| Very Low|Fairly Low | -2.0677 | 0.1644 | -12.5809 | 0.0000 | Cutpoint | -2.0677 | NA | NA | NA | NA |
| Fairly Low|Average | -0.3103 | 0.1451 | -2.1382 | 0.0325 | Cutpoint | -0.3103 | NA | NA | NA | NA |
| Average|Fairly High | 1.3350 | 0.1493 | 8.9392 | 0.0000 | Cutpoint | 1.3350 | NA | NA | NA | NA |
| Fairly High|Very High | 2.7582 | 0.1594 | 17.3066 | 0.0000 | Cutpoint | 2.7582 | NA | NA | NA | NA |
✅ Interpretasi Tabel Koefisien
Berikut interpretasi prediktor yang signifikan (p-value < 0,05):
pred_prob_ord <- predict(fit_ord, type = "probs")
data_ord_pred <- data_ord %>%
bind_cols(as.data.frame(pred_prob_ord)) %>%
mutate(prediksi = predict(fit_ord, type = "class"))
head(data_ord_pred)| y | collegeDegree | female | age | Very Low | Fairly Low | Average | Fairly High | Very High | prediksi |
|---|---|---|---|---|---|---|---|---|---|
| Fairly High | Yes | No | 49 | 0.0122544 | 0.0548414 | 0.2044293 | 0.3358621 | 0.3926127 | Very High |
| Average | No | Yes | 35 | 0.1220870 | 0.3242567 | 0.3605308 | 0.1386026 | 0.0545229 | Average |
| Very High | No | Yes | 57 | 0.0894619 | 0.2734231 | 0.3840664 | 0.1775863 | 0.0754624 | Average |
| Average | No | No | 63 | 0.0446792 | 0.1686151 | 0.3709208 | 0.2694127 | 0.1463722 | Average |
| Fairly High | Yes | Yes | 40 | 0.0266009 | 0.1101562 | 0.3141003 | 0.3222663 | 0.2268763 | Fairly High |
| Average | No | No | 77 | 0.0361374 | 0.1424041 | 0.3511790 | 0.2940731 | 0.1762065 | Average |
#> Prediksi
#> Aktual Very Low Fairly Low Average Fairly High Very High
#> Very Low 0 10 87 6 2
#> Fairly Low 0 39 258 22 13
#> Average 0 28 424 86 47
#> Fairly High 0 12 219 109 107
#> Very High 0 2 125 91 105
#> Akurasi: 0.3778
grid_ord <- expand.grid(
age = seq(min(data_ord$age, na.rm = TRUE),
max(data_ord$age, na.rm = TRUE), length.out = 100),
collegeDegree = factor("Yes", levels = c("No", "Yes")),
female = factor("Yes", levels = c("No", "Yes"))
)
grid_prob_ord <- predict(fit_ord, newdata = grid_ord, type = "probs")
grid_ord_plot <- grid_ord %>%
bind_cols(as.data.frame(grid_prob_ord)) %>%
tidyr::pivot_longer(
cols = c("Very Low", "Fairly Low", "Average", "Fairly High", "Very High"),
names_to = "kategori",
values_to = "probabilitas"
) %>%
mutate(kategori = ordered(kategori, levels = c("Very Low", "Fairly Low", "Average",
"Fairly High", "Very High")))
ggplot(grid_ord_plot, aes(x = age, y = probabilitas, color = kategori)) +
geom_line(linewidth = 1.35) +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Prediksi Probabilitas Tingkat Pengetahuan Politik",
subtitle = "collegeDegree = Yes, female = Yes (Variabel lain ditahan pada nilai rata-rata).",
x = "Umur (age)",
y = "Probabilitas prediksi",
color = "Tingkat pengetahuan"
)✅ Interpretasi Grafik
Grafik di atas membantu membaca hasil model secara lebih substantif (collegeDegree = Yes, female = Yes, age divariasikan).
Dataset yang digunakan adalah Affairs (package AER),
berisi data 601 responden tentang jumlah perselingkuhan dalam setahun
terakhir. Variabel respons adalah affairs (count: 0, 1, 2,
…) dan prediktor yang digunakan adalah age (umur),
yearsmarried (lama menikah), dan religiousness
(tingkat religiusitas).
library(AER)
data(Affairs)
data_pois <- Affairs %>%
dplyr::select(affairs, age, yearsmarried, religiousness)
dplyr::glimpse(data_pois)#> Rows: 601
#> Columns: 4
#> $ affairs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ age <dbl> 37, 27, 32, 57, 22, 32, 22, 57, 32, 22, 37, 27, 47, 22, …
#> $ yearsmarried <dbl> 10.00, 4.00, 15.00, 15.00, 0.75, 1.50, 0.75, 15.00, 15.0…
#> $ religiousness <int> 3, 4, 1, 5, 2, 2, 2, 2, 4, 4, 2, 4, 5, 2, 4, 1, 2, 3, 2,…
data_pois %>%
count(affairs) %>%
mutate(proporsi = n / sum(n)) %>%
kable(digits = 3, caption = "Distribusi jumlah perselingkuhan") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE)| affairs | n | proporsi |
|---|---|---|
| 0 | 451 | 0.750 |
| 1 | 34 | 0.057 |
| 2 | 17 | 0.028 |
| 3 | 19 | 0.032 |
| 7 | 42 | 0.070 |
| 12 | 38 | 0.063 |
data_pois %>%
count(affairs) %>%
ggplot(aes(x = affairs, y = n)) +
geom_col(width = 0.8, fill = "#2f7f73", alpha = 0.92) +
scale_x_continuous(breaks = scales::pretty_breaks()) +
labs(
title = "Distribusi Jumlah Perselingkuhan",
subtitle = "Respons Poisson berupa hitungan bilangan bulat nonnegatif.",
x = "Jumlah perselingkuhan",
y = "Frekuensi"
)fit_pois <- glm(
affairs ~ age + yearsmarried + religiousness,
data = data_pois,
family = poisson(link = "log")
)
summary(fit_pois)#>
#> Call:
#> glm(formula = affairs ~ age + yearsmarried + religiousness, family = poisson(link = "log"),
#> data = data_pois)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.130828 0.157088 7.199 6.08e-13 ***
#> age -0.026825 0.005672 -4.729 2.25e-06 ***
#> yearsmarried 0.129700 0.009671 13.412 < 2e-16 ***
#> religiousness -0.366042 0.029512 -12.403 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 2925.5 on 600 degrees of freedom
#> Residual deviance: 2584.1 on 597 degrees of freedom
#> AIC: 3086.1
#>
#> Number of Fisher Scoring iterations: 6
pois_coef <- as.data.frame(coef(summary(fit_pois))) %>%
tibble::rownames_to_column("parameter") %>%
dplyr::rename(
estimate = Estimate,
std_error = `Std. Error`,
z_value = `z value`,
p_value = `Pr(>|z|)`
) %>%
mutate(
IRR = exp(estimate),
CI_low = exp(estimate - 1.96 * std_error),
CI_high = exp(estimate + 1.96 * std_error),
perubahan_persen = 100 * (IRR - 1)
)
pois_coef %>%
mutate(across(c(estimate, std_error, z_value, p_value,
IRR, CI_low, CI_high, perubahan_persen), ~ round(.x, 4))) %>%
kable(caption = "Ringkasan hasil regresi Poisson",
col.names = c("Parameter", "Estimate", "SE", "z-value", "p-value",
"IRR", "CI 2.5%", "CI 97.5%", "Perubahan (%)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = TRUE)| Parameter | Estimate | SE | z-value | p-value | IRR | CI 2.5% | CI 97.5% | Perubahan (%) |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 1.1308 | 0.1571 | 7.1987 | 0 | 3.0982 | 2.2772 | 4.2153 | 209.8222 |
| age | -0.0268 | 0.0057 | -4.7294 | 0 | 0.9735 | 0.9628 | 0.9844 | -2.6468 |
| yearsmarried | 0.1297 | 0.0097 | 13.4117 | 0 | 1.1385 | 1.1171 | 1.1603 | 13.8487 |
| religiousness | -0.3660 | 0.0295 | -12.4030 | 0 | 0.6935 | 0.6545 | 0.7348 | -30.6527 |
✅ Interpretasi Tabel Koefisien
Berikut interpretasi prediktor yang signifikan (p-value < 0,05):
grid_pois <- expand.grid(
religiousness = seq(min(data_pois$religiousness),
max(data_pois$religiousness), length.out = 100),
age = mean(data_pois$age),
yearsmarried = mean(data_pois$yearsmarried)
)
pred <- predict(fit_pois, newdata = grid_pois, type = "link", se.fit = TRUE)
grid_pois <- grid_pois %>%
mutate(
prediksi_rate = exp(pred$fit),
CI_low = exp(pred$fit - 1.96 * pred$se.fit),
CI_high = exp(pred$fit + 1.96 * pred$se.fit)
)
ggplot(grid_pois, aes(x = religiousness)) +
geom_ribbon(aes(ymin = CI_low, ymax = CI_high), fill = "#2f7f73", alpha = 0.2) +
geom_line(aes(y = prediksi_rate), linewidth = 1.35, color = "#2f7f73") +
labs(
title = "Prediksi Rata-rata Jumlah Perselingkuhan",
subtitle = "age dan yearsmarried ditahan pada nilai rata-rata. Area = CI 95%.",
x = "Tingkat religiusitas",
y = "Prediksi rata-rata jumlah perselingkuhan"
)