<h2>Bagian I β Regresi Logistik Multinomial</h2>
<p>Dataset: Iris | Respons: Species (3 kelas nominal)</p>
π Konsep Dasar
Regresi logistik multinomial digunakan saat variabel respons bersifat nominal dengan tiga atau lebih kategori tanpa urutan alami. Model menggunakan satu kategori sebagai baseline (referensi) dan memodelkan log-odds masing-masing kategori lainnya terhadap baseline:
log(P(Y=k) / P(Y=ref)) = Ξ²ββ + Ξ²ββXβ + β¦ + Ξ²ββXβ
Koefisien diinterpretasikan sebagai perubahan log-odds
memilih kategori k dibanding referensi, atau dalam bentuk
Relative Risk Ratio (RRR) = exp(Ξ²β±Όβ).
Model difit dengan paket nnet::multinom().
# ββ SESUAIKAN: ganti nama file dan nama kolom jika berbeda ββββββββββββββββββββ
data_multi <- read.csv("data_multinomial.csv", sep = ";")
var_y_multi <- "Species"
var_x_multi <- setdiff(colnames(data_multi), c("Id", "Species"))
data_multi <- data_multi %>%
mutate(
Species = factor(Species,
levels = c("Iris-setosa", "Iris-versicolor", "Iris-virginica"))
)
dplyr::glimpse(data_multi)## Rows: 150
## Columns: 6
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1β¦
## $ SepalLengthCm <int> 51, 49, 47, 46, 50, 54, 46, 50, 44, 49, 54, 48, 48, 43, β¦
## $ SepalWidthCm <int> 35, 30, 32, 31, 36, 39, 34, 34, 29, 31, 37, 34, 30, 30, β¦
## $ PetalLengthCm <int> 14, 14, 13, 15, 14, 17, 14, 15, 14, 15, 15, 16, 14, 11, β¦
## $ PetalWidthCm <int> 2, 2, 2, 2, 2, 4, 3, 2, 2, 1, 2, 2, 1, 1, 2, 4, 4, 3, 3,β¦
## $ Species <fct> Iris-setosa, Iris-setosa, Iris-setosa, Iris-setosa, Irisβ¦
data_multi %>%
count(Species) %>%
mutate(Proporsi = scales::percent(n / sum(n), accuracy = 0.1)) %>%
knitr::kable(caption = "Distribusi Spesies Iris") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)| Species | n | Proporsi |
|---|---|---|
| Iris-setosa | 50 | 33.3% |
| Iris-versicolor | 50 | 33.3% |
| Iris-virginica | 50 | 33.3% |
data_multi %>%
count(Species) %>%
mutate(proporsi = n / sum(n)) %>%
ggplot(aes(x = Species, y = proporsi, fill = Species)) +
geom_col(width = 0.65, alpha = 0.94) +
geom_text(aes(label = scales::percent(proporsi, accuracy = 0.1)),
vjust = -0.4, fontface = "bold", color = "#243142") +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.65)) +
scale_fill_manual(values = c("#2f7f73","#5568b8","#d18b2f")) +
labs(title = "Distribusi Spesies Iris",
subtitle = "Respons multinomial β 3 kategori tanpa urutan",
x = NULL, y = "Proporsi") +
theme_minimal() + theme(legend.position = "none")formula_multi <- as.formula(paste(var_y_multi, "~", paste(var_x_multi, collapse = " + ")))
cat("Formula:", deparse(formula_multi), "\n")## Formula: Species ~ SepalLengthCm + SepalWidthCm + PetalLengthCm + PetalWidthCm
## Call:
## nnet::multinom(formula = formula_multi, data = data_multi, trace = FALSE)
##
## Coefficients:
## (Intercept) SepalLengthCm SepalWidthCm PetalLengthCm
## Iris-versicolor 22.24756 -0.8914206 -2.814642 4.180154
## Iris-virginica -20.39006 -1.1379418 -3.482729 5.123089
## PetalWidthCm
## Iris-versicolor 1.351080
## Iris-virginica 3.179689
##
## Std. Errors:
## (Intercept) SepalLengthCm SepalWidthCm PetalLengthCm
## Iris-versicolor 12.85375 0.1197144 0.2239780 0.2368581
## Iris-virginica 12.85375 0.1197150 0.2239769 0.2368591
## PetalWidthCm
## Iris-versicolor 0.4871285
## Iris-virginica 0.4871288
##
## Residual Deviance: 11.89855
## AIC: 31.89855
multi_sum <- summary(fit_multi)
coef_long <- as.data.frame(multi_sum$coefficients) %>%
tibble::rownames_to_column("kategori") %>%
tidyr::pivot_longer(-kategori, names_to = "variabel", values_to = "estimate")
se_long <- as.data.frame(multi_sum$standard.errors) %>%
tibble::rownames_to_column("kategori") %>%
tidyr::pivot_longer(-kategori, names_to = "variabel", values_to = "std_error")
result_multi <- coef_long %>%
left_join(se_long, by = c("kategori","variabel")) %>%
mutate(
z_value = estimate / std_error,
p_value = 2 * (1 - pnorm(abs(z_value))),
RRR = exp(estimate),
CI_low = exp(estimate - 1.96 * std_error),
CI_high = exp(estimate + 1.96 * std_error),
Signifikansi = case_when(
p_value < 0.001 ~ "***", p_value < 0.01 ~ "**",
p_value < 0.05 ~ "*", p_value < 0.1 ~ ".",
TRUE ~ ""
)
) %>%
mutate(across(c(estimate,std_error,z_value,p_value,RRR,CI_low,CI_high), ~round(.x,4)))
result_multi %>%
knitr::kable(
caption = "Ringkasan Model Multinomial (Base: Iris-setosa)",
col.names = c("Kategori","Variabel","Estimate","SE","z-value","p-value","RRR","CI 2.5%","CI 97.5%","Sig.")
) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"), full_width = TRUE)| Kategori | Variabel | Estimate | SE | z-value | p-value | RRR | CI 2.5% | CI 97.5% | Sig. |
|---|---|---|---|---|---|---|---|---|---|
| Iris-versicolor | (Intercept) | 22.2476 | 12.8537 | 1.7308 | 0.0835 | 4.591905e+09 | 0.0526 | 4.011665e+20 | . |
| Iris-versicolor | SepalLengthCm | -0.8914 | 0.1197 | -7.4462 | 0.0000 | 4.101000e-01 | 0.3243 | 5.185000e-01 | *** |
| Iris-versicolor | SepalWidthCm | -2.8146 | 0.2240 | -12.5666 | 0.0000 | 5.990000e-02 | 0.0386 | 9.300000e-02 | *** |
| Iris-versicolor | PetalLengthCm | 4.1802 | 0.2369 | 17.6484 | 0.0000 | 6.537590e+01 | 41.0961 | 1.040005e+02 | *** |
| Iris-versicolor | PetalWidthCm | 1.3511 | 0.4871 | 2.7736 | 0.0055 | 3.861600e+00 | 1.4863 | 1.003270e+01 | ** |
| Iris-virginica | (Intercept) | -20.3901 | 12.8537 | -1.5863 | 0.1127 | 0.000000e+00 | 0.0000 | 1.219101e+02 | |
| Iris-virginica | SepalLengthCm | -1.1379 | 0.1197 | -9.5054 | 0.0000 | 3.205000e-01 | 0.2535 | 4.052000e-01 | *** |
| Iris-virginica | SepalWidthCm | -3.4827 | 0.2240 | -15.5495 | 0.0000 | 3.070000e-02 | 0.0198 | 4.770000e-02 | *** |
| Iris-virginica | PetalLengthCm | 5.1231 | 0.2369 | 21.6293 | 0.0000 | 1.678530e+02 | 105.5141 | 2.670224e+02 | *** |
| Iris-virginica | PetalWidthCm | 3.1797 | 0.4871 | 6.5274 | 0.0000 | 2.403930e+01 | 9.2527 | 6.245590e+01 | *** |
data_multi_pred <- data_multi %>%
bind_cols(as.data.frame(predict(fit_multi, type = "probs"))) %>%
mutate(prediksi = predict(fit_multi, type = "class"))
conf_multi <- table(Aktual = data_multi_pred$Species, Prediksi = data_multi_pred$prediksi)
knitr::kable(conf_multi, caption = "Confusion Matrix β Model Multinomial") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Iris-setosa | Iris-versicolor | Iris-virginica | |
|---|---|---|---|
| Iris-setosa | 50 | 0 | 0 |
| Iris-versicolor | 0 | 49 | 1 |
| Iris-virginica | 0 | 1 | 49 |
accuracy_multi <- sum(diag(conf_multi)) / sum(conf_multi)
cat("Akurasi Model:", scales::percent(accuracy_multi, accuracy = 0.01), "\n")## Akurasi Model: 98.67%
grid_multi <- expand.grid(
SepalLengthCm = seq(min(data_multi$SepalLengthCm), max(data_multi$SepalLengthCm), length.out = 100),
SepalWidthCm = mean(data_multi$SepalWidthCm),
PetalLengthCm = mean(data_multi$PetalLengthCm),
PetalWidthCm = mean(data_multi$PetalWidthCm)
)
grid_multi %>%
bind_cols(as.data.frame(predict(fit_multi, newdata = grid_multi, type = "probs"))) %>%
tidyr::pivot_longer(cols = c("Iris-setosa","Iris-versicolor","Iris-virginica"),
names_to = "Species", values_to = "probabilitas") %>%
ggplot(aes(x = SepalLengthCm, y = probabilitas, color = Species)) +
geom_line(linewidth = 1.35) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_manual(values = c("#2f7f73","#5568b8","#d18b2f")) +
labs(title = "Prediksi Probabilitas Spesies Iris",
subtitle = "Variabel lain ditahan pada nilai rata-rata",
x = "Sepal Length (Cm)", y = "Probabilitas Prediksi", color = "Spesies") +
theme_minimal()π‘ Interpretasi β Logistik Multinomial
Sesuaikan teks berikut dengan hasil RRR dari output kamu:
β’
Kategori referensi: Iris-setosa. Semua
interpretasi dibandingkan terhadap kategori ini.
β’ RRR
(Relative Risk Ratio): Jika RRR untuk variabel X pada kategori
Iris-versicolor = 2.5, artinya kenaikan 1 satuan X meningkatkan
risiko relatif menjadi versicolor (vs setosa) sebesar 2.5 kali.
β’
Signifikansi (p-value < 0.05): Hanya variabel dengan
tanda *** atau ** yang secara statistik signifikan membedakan antar
kelas.
β’ Grafik probabilitas: Menunjukkan
bagaimana peluang masing-masing spesies berubah seiring perubahan Sepal
Length, saat variabel lain dipegang tetap.
β’
Akurasi: (lihat output confusion matrix) β
akurasi tinggi menunjukkan model dapat membedakan ketiga spesies dengan
baik.
<h2>Bagian II β Regresi Logistik Ordinal</h2>
<p>Dataset: Student Performance | Respons: Grade (Rendah < Sedang < Tinggi)</p>
π Konsep Dasar
Regresi logistik ordinal (Proportional Odds Model) digunakan saat variabel respons bersifat ordinal β ada urutan alami antar kategori. Model memodelkan log-odds kumulatif:
logit(P(Y β€ j)) = Ξ±β±Ό β (Ξ²βXβ + β¦ + Ξ²βXβ)
Asumsi utama: Proportional Odds β efek setiap
prediktor dianggap sama untuk semua batas kumulatif (parallel lines).
Koefisien diinterpretasikan sebagai log-odds kumulatif, dengan
exp(Ξ²) sebagai odds ratio (polaritas tergantung konvensi
software). Difit dengan MASS::polr().
# ββ SESUAIKAN: ganti nama file dan nama kolom jika berbeda ββββββββββββββββββββ
data_ord <- read.csv("data_ordinal.csv", sep = ";")
data_ord <- data_ord %>%
mutate(
Grade = factor(Grade, levels = c("Rendah","Sedang","Tinggi"), ordered = TRUE),
Gender = as.factor(Gender)
)
dplyr::glimpse(data_ord)## Rows: 6,607
## Columns: 21
## $ Hours_Studied <int> 23, 19, 24, 29, 19, 19, 29, 25, 17, 23, 17,β¦
## $ Attendance <int> 84, 64, 98, 89, 92, 88, 84, 78, 94, 98, 80,β¦
## $ Parental_Involvement <chr> "Low", "Low", "Medium", "Low", "Medium", "Mβ¦
## $ Access_to_Resources <chr> "High", "Medium", "Medium", "Medium", "Mediβ¦
## $ Extracurricular_Activities <chr> "No", "No", "Yes", "Yes", "Yes", "Yes", "Yeβ¦
## $ Sleep_Hours <int> 7, 8, 7, 8, 6, 8, 7, 6, 6, 8, 8, 6, 8, 8, 8β¦
## $ Previous_Scores <int> 73, 59, 91, 98, 65, 89, 68, 50, 80, 71, 88,β¦
## $ Motivation_Level <chr> "Low", "Low", "Medium", "Medium", "Medium",β¦
## $ Internet_Access <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "β¦
## $ Tutoring_Sessions <int> 0, 2, 2, 1, 3, 3, 1, 1, 0, 0, 4, 2, 2, 2, 1β¦
## $ Family_Income <chr> "Low", "Medium", "Medium", "Medium", "Mediuβ¦
## $ Teacher_Quality <chr> "Medium", "Medium", "Medium", "Medium", "Hiβ¦
## $ School_Type <chr> "Public", "Public", "Public", "Public", "Puβ¦
## $ Peer_Influence <chr> "Positive", "Negative", "Neutral", "Negativβ¦
## $ Physical_Activity <int> 3, 4, 4, 4, 4, 3, 2, 2, 1, 5, 4, 2, 4, 3, 4β¦
## $ Learning_Disabilities <chr> "No", "No", "No", "No", "No", "No", "No", "β¦
## $ Parental_Education_Level <chr> "High School", "College", "Postgraduate", "β¦
## $ Distance_from_Home <chr> "Near", "Moderate", "Near", "Moderate", "Neβ¦
## $ Gender <fct> Male, Female, Male, Male, Female, Male, Malβ¦
## $ Exam_Score <int> 67, 61, 74, 71, 70, 71, 67, 66, 69, 72, 68,β¦
## $ Grade <ord> Sedang, Rendah, Sedang, Sedang, Sedang, Sedβ¦
data_ord %>%
count(Grade) %>%
mutate(Proporsi = scales::percent(n / sum(n), accuracy = 0.1)) %>%
knitr::kable(caption = "Distribusi Tingkat Grade Mahasiswa") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)| Grade | n | Proporsi |
|---|---|---|
| Rendah | 316 | 4.8% |
| Sedang | 6167 | 93.3% |
| Tinggi | 124 | 1.9% |
data_ord %>%
count(Grade) %>%
mutate(proporsi = n / sum(n)) %>%
ggplot(aes(x = Grade, y = proporsi, fill = Grade)) +
geom_col(width = 0.65, alpha = 0.94) +
geom_text(aes(label = scales::percent(proporsi, accuracy = 0.1)),
vjust = -0.4, fontface = "bold", color = "#243142") +
scale_y_continuous(labels = scales::percent_format(), expand = expansion(mult = c(0, 0.18))) +
scale_fill_manual(values = c("#2f7f73","#d18b2f","#5568b8")) +
labs(title = "Distribusi Tingkat Grade",
subtitle = "Respons ordinal: Rendah < Sedang < Tinggi",
x = "Grade", y = "Proporsi") +
theme_minimal() + theme(legend.position = "none")fit_ord <- MASS::polr(
Grade ~ Hours_Studied + Attendance + Previous_Scores + Sleep_Hours + Gender,
data = data_ord,
method = "logistic",
Hess = TRUE
)
summary(fit_ord)## Call:
## MASS::polr(formula = Grade ~ Hours_Studied + Attendance + Previous_Scores +
## Sleep_Hours + Gender, data = data_ord, Hess = TRUE, method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## Hours_Studied 0.26628 0.011780 22.6048
## Attendance 0.16474 0.008164 20.1781
## Previous_Scores 0.03508 0.004047 8.6675
## Sleep_Hours -0.02866 0.038071 -0.7529
## GenderMale -0.06097 0.114288 -0.5334
##
## Intercepts:
## Value Std. Error t value
## Rendah|Sedang 15.7674 0.7880 20.0105
## Sedang|Tinggi 27.3238 1.0385 26.3097
##
## Residual Deviance: 2328.836
## AIC: 2342.836
ord_coef <- coef(summary(fit_ord))
result_ord <- as.data.frame(ord_coef) %>%
tibble::rownames_to_column("parameter") %>%
dplyr::rename(estimate = Value, std_error = `Std. Error`, t_value = `t value`) %>%
mutate(
p_value = 2 * (1 - pnorm(abs(t_value))),
jenis = ifelse(grepl("\\|", parameter), "Cutpoint", "Koefisien"),
OR_polr = ifelse(jenis == "Koefisien", exp(estimate), NA_real_),
CI_low_polr = ifelse(jenis == "Koefisien", exp(estimate - 1.96 * std_error), NA_real_),
CI_high_polr = ifelse(jenis == "Koefisien", exp(estimate + 1.96 * std_error), NA_real_),
Signifikansi = case_when(
p_value < 0.001 ~ "***", p_value < 0.01 ~ "**",
p_value < 0.05 ~ "*", p_value < 0.1 ~ ".",
TRUE ~ ""
)
) %>%
mutate(across(c(estimate,std_error,t_value,p_value,OR_polr,CI_low_polr,CI_high_polr), ~round(.x,4)))
result_ord %>%
knitr::kable(
caption = "Ringkasan Model Ordinal (Proportional Odds)",
col.names = c("Parameter","Estimate","SE","t-value","p-value","Jenis","OR polr","CI 2.5%","CI 97.5%","Sig.")
) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"), full_width = TRUE) %>%
row_spec(which(result_ord$jenis == "Cutpoint"), background = "#f0f9f8", bold = TRUE)| Parameter | Estimate | SE | t-value | p-value | Jenis | OR polr | CI 2.5% | CI 97.5% | Sig. |
|---|---|---|---|---|---|---|---|---|---|
| Hours_Studied | 0.2663 | 0.0118 | 22.6048 | 0.0000 | Koefisien | 1.3051 | 1.2753 | 1.3356 | *** |
| Attendance | 0.1647 | 0.0082 | 20.1781 | 0.0000 | Koefisien | 1.1791 | 1.1604 | 1.1981 | *** |
| Previous_Scores | 0.0351 | 0.0040 | 8.6675 | 0.0000 | Koefisien | 1.0357 | 1.0275 | 1.0439 | *** |
| Sleep_Hours | -0.0287 | 0.0381 | -0.7529 | 0.4515 | Koefisien | 0.9717 | 0.9019 | 1.0470 | |
| GenderMale | -0.0610 | 0.1143 | -0.5334 | 0.5937 | Koefisien | 0.9409 | 0.7520 | 1.1771 | |
| Rendah|Sedang | 15.7674 | 0.7880 | 20.0105 | 0.0000 | Cutpoint | NA | NA | NA | *** |
| Sedang|Tinggi | 27.3238 | 1.0385 | 26.3097 | 0.0000 | Cutpoint | NA | NA | NA | *** |
data_ord_pred <- data_ord %>%
bind_cols(as.data.frame(predict(fit_ord, type = "probs"))) %>%
mutate(prediksi = predict(fit_ord, type = "class"))
conf_ord <- table(Aktual = data_ord_pred$Grade, Prediksi = data_ord_pred$prediksi)
knitr::kable(conf_ord, caption = "Confusion Matrix β Model Ordinal") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Rendah | Sedang | Tinggi | |
|---|---|---|---|
| Rendah | 87 | 229 | 0 |
| Sedang | 23 | 6138 | 6 |
| Tinggi | 4 | 104 | 16 |
accuracy_ord <- sum(diag(conf_ord)) / sum(conf_ord)
cat("Akurasi Model:", scales::percent(accuracy_ord, accuracy = 0.01), "\n")## Akurasi Model: 94.46%
grid_ord <- expand.grid(
Hours_Studied = seq(min(data_ord$Hours_Studied, na.rm=TRUE),
max(data_ord$Hours_Studied, na.rm=TRUE), length.out = 120),
Attendance = mean(data_ord$Attendance, na.rm=TRUE),
Previous_Scores = mean(data_ord$Previous_Scores, na.rm=TRUE),
Sleep_Hours = mean(data_ord$Sleep_Hours, na.rm=TRUE),
Gender = "Female"
)
grid_ord %>%
bind_cols(as.data.frame(predict(fit_ord, newdata = grid_ord, type = "probs"))) %>%
tidyr::pivot_longer(cols = c("Rendah","Sedang","Tinggi"),
names_to = "Grade", values_to = "probabilitas") %>%
mutate(Grade = factor(Grade, levels = c("Rendah","Sedang","Tinggi"))) %>%
ggplot(aes(x = Hours_Studied, y = probabilitas, color = Grade)) +
geom_line(linewidth = 1.25) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_manual(values = c("#2f7f73","#d18b2f","#5568b8")) +
labs(title = "Prediksi Probabilitas Grade berdasarkan Jam Belajar",
subtitle = "Variabel lain ditahan pada rata-rata; Gender = Female",
x = "Jam Belajar (Hours Studied)", y = "Probabilitas Prediksi", color = "Grade") +
theme_minimal()fit_clm <- ordinal::clm(
Grade ~ Hours_Studied + Attendance + Previous_Scores + Sleep_Hours + Gender,
data = data_ord, link = "logit"
)
summary(fit_clm)## formula:
## Grade ~ Hours_Studied + Attendance + Previous_Scores + Sleep_Hours + Gender
## data: data_ord
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 6607 -1164.42 2342.84 10(0) 3.34e-11 6.3e+06
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## Hours_Studied 0.266278 0.011779 22.606 <2e-16 ***
## Attendance 0.164743 0.008172 20.161 <2e-16 ***
## Previous_Scores 0.035076 0.004049 8.664 <2e-16 ***
## Sleep_Hours -0.028664 0.038071 -0.753 0.451
## GenderMale -0.060965 0.114288 -0.533 0.594
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## Rendah|Sedang 15.7674 0.7884 20.0
## Sedang|Tinggi 27.3238 1.0390 26.3
π‘ Interpretasi β Logistik Ordinal
Sesuaikan teks berikut dengan output model kamu:
β’
Koefisien positif (polr): Variabel X yang memiliki
koefisien positif artinya peningkatan X menurunkan peluang berada di
kategori lebih rendah β ekuivalen dengan meningkatkan peluang Grade
lebih tinggi. Perhatikan: MASS::polr menggunakan konvensi tanda
negatif untuk arah βnaikβ.
β’ Cutpoints (Ξ±):
Threshold pemisah antar kategori kumulatif, misalnya batas antara
Rendah|Sedang dan Sedang|Tinggi.
β’ Asumsi Parallel
Lines: Jika nominal_test menghasilkan p-value > 0.05, asumsi
proportional odds tidak ditolak β model valid digunakan.
β’
Akurasi: (lihat output confusion matrix) β
model ordinal umumnya lebih akurat dari tebakan acak jika struktur
ordinal dimanfaatkan dengan baik.
<h2>Bagian III β Regresi Poisson</h2>
<p>Dataset: Traffic Accidents | Respons: injuries (count)</p>
π Konsep Dasar
Regresi Poisson digunakan untuk memodelkan variabel respons berupa hitungan (count) β bilangan bulat non-negatif (0, 1, 2, β¦). Model menggunakan fungsi link log:
log(E[Y|X]) = Ξ²β + Ξ²βXβ + β¦ + Ξ²βXβ
Koefisien diinterpretasikan sebagai Incidence Rate Ratio
(IRR) = exp(Ξ²): kenaikan 1 satuan Xβ±Ό mengalikan
rata-rata hitungan Y sebesar exp(Ξ²β±Ό) kali. Asumsi penting:
mean = variance. Jika variance > mean (overdispersion),
pertimbangkan model Negative Binomial.
# ββ SESUAIKAN: ganti nama file jika berbeda ββββββββββββββββββββββββββββββββββββ
data_poi <- read.csv("data_poisson1.csv", sep = ";")
var_y_poi <- "injuries"
var_x_poi <- setdiff(colnames(data_poi), c("id","date","time","latitude","longitude","injuries"))
data_poi <- data_poi %>% mutate(across(where(is.character), as.factor))
dplyr::glimpse(data_poi)## Rows: 99
## Columns: 13
## $ time <fct> "0,398611111", "0,64375", "0,576388889", "0,00416666β¦
## $ latitude <int> 102909, 365251, 264679, 285199, 222705, 174806, 2519β¦
## $ longitude <int> 94733, 921448, 753151, 774946, 879632, 969596, 84997β¦
## $ severity <fct> Critical, Low, Medium, Medium, Critical, Low, Low, Mβ¦
## $ road_condition <fct> Potholed, Construction, Dry, Potholed, Muddy, Floodiβ¦
## $ weather <fct> Cloudy, Cloudy, Cloudy, Clear, Dust Storm, Heavy Raiβ¦
## $ vehicles_involved <int> 1, 4, 4, 2, 1, 2, 2, 3, 2, 3, 3, 2, 1, 4, 4, 3, 4, 2β¦
## $ injuries <int> 8, 0, 0, 2, 4, 0, 0, 2, 0, 5, 0, 0, 0, 2, 2, 3, 2, 1β¦
## $ fatalities <int> 3, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 1, 1β¦
## $ accident_cause <fct> Poor Road, Poor Road, Weather, Poor Road, Weather, Wβ¦
## $ traffic_density <fct> Moderate, Heavy, Light, Heavy, Light, Light, Moderatβ¦
## $ lane_utilization <fct> Overtaking, Congested Multi-Lane, Single Lane, Congeβ¦
## $ nearby_accidents <int> 5, 1, 8, 44, 1, 3, 22, 1, 19, 3, 6, 8, 5, 10, 5, 48,β¦
data_poi %>%
count(.data[[var_y_poi]], name = "n") %>%
mutate(Proporsi = scales::percent(n / sum(n), accuracy = 0.1)) %>%
knitr::kable(caption = "Distribusi Jumlah Cedera (Injuries)") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)| injuries | n | Proporsi |
|---|---|---|
| 0 | 36 | 36.4% |
| 1 | 11 | 11.1% |
| 2 | 24 | 24.2% |
| 3 | 8 | 8.1% |
| 4 | 11 | 11.1% |
| 5 | 2 | 2.0% |
| 6 | 2 | 2.0% |
| 7 | 3 | 3.0% |
| 8 | 2 | 2.0% |
data_poi %>%
count(.data[[var_y_poi]], name = "n") %>%
ggplot(aes(x = .data[[var_y_poi]], y = n)) +
geom_col(width = 0.8, fill = "#2f7f73", alpha = 0.92) +
scale_x_continuous(breaks = scales::pretty_breaks()) +
labs(title = "Distribusi Jumlah Cedera",
subtitle = "Respons Poisson: bilangan bulat non-negatif",
x = "Jumlah Cedera (Injuries)", y = "Frekuensi") +
theme_minimal()# Cek zero-inflation
zero_share <- mean(data_poi[[var_y_poi]] == 0)
cat("Proporsi Y = 0:", scales::percent(zero_share, accuracy = 0.1), "\n")## Proporsi Y = 0: 36.4%
formula_poi <- as.formula(paste(var_y_poi, "~", paste(var_x_poi, collapse = " + ")))
cat("Formula:", deparse(formula_poi), "\n")## Formula: injuries ~ severity + road_condition + weather + vehicles_involved + fatalities + accident_cause + traffic_density + lane_utilization + nearby_accidents
##
## Call:
## glm(formula = formula_poi, family = poisson(link = "log"), data = data_poi)
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.789e+00 1.084e+00 1.650 0.0989 .
## severityHigh -5.697e-01 2.531e-01 -2.251 0.0244 *
## severityLow -2.134e+01 2.785e+03 -0.008 0.9939
## severityMedium -8.193e-01 4.183e-01 -1.959 0.0501 .
## road_conditionDry 2.834e-01 5.420e-01 0.523 0.6011
## road_conditionFlooding 1.385e-01 5.141e-01 0.269 0.7876
## road_conditionMuddy 3.317e-01 5.954e-01 0.557 0.5774
## road_conditionPotholed 7.888e-02 3.976e-01 0.198 0.8428
## road_conditionWet 8.822e-02 5.407e-01 0.163 0.8704
## weatherCloudy -2.006e-02 2.725e-01 -0.074 0.9413
## weatherDust Storm 1.736e-01 8.193e-01 0.212 0.8322
## weatherFog 8.000e-01 7.516e-01 1.064 0.2872
## weatherHeavy Rain 4.969e-01 7.122e-01 0.698 0.4854
## weatherRain NA NA NA NA
## vehicles_involved -1.804e-02 7.637e-02 -0.236 0.8133
## fatalities 9.633e-02 1.239e-01 0.777 0.4370
## accident_causeHuman Error -6.921e-01 8.132e-01 -0.851 0.3947
## accident_causeMechanical Failure -8.933e-01 8.756e-01 -1.020 0.3076
## accident_causePoor Road -8.269e-01 8.960e-01 -0.923 0.3561
## accident_causeSignal Violation -7.722e-01 8.510e-01 -0.907 0.3642
## accident_causeWeather -1.091e+00 9.897e-01 -1.102 0.2705
## traffic_densityLight -4.919e-01 5.360e-01 -0.918 0.3588
## traffic_densityModerate -3.719e-01 5.026e-01 -0.740 0.4593
## lane_utilizationLane Change 6.845e-01 4.176e-01 1.639 0.1012
## lane_utilizationOvertaking 8.180e-01 4.737e-01 1.727 0.0842 .
## lane_utilizationSingle Lane NA NA NA NA
## nearby_accidents 3.621e-03 1.015e-02 0.357 0.7212
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 230.14 on 98 degrees of freedom
## Residual deviance: 30.19 on 74 degrees of freedom
## AIC: 258.35
##
## Number of Fisher Scoring iterations: 18
pois_coef <- as.data.frame(coef(summary(fit_pois))) %>%
tibble::rownames_to_column("parameter") %>%
dplyr::rename(estimate = Estimate, std_error = `Std. Error`, z_value = `z value`, p_value = `Pr(>|z|)`) %>%
mutate(
IRR = exp(estimate),
CI_low = exp(estimate - 1.96 * std_error),
CI_high = exp(estimate + 1.96 * std_error),
perubahan_persen = 100 * (IRR - 1),
Signifikansi = case_when(
p_value < 0.001 ~ "***", p_value < 0.01 ~ "**",
p_value < 0.05 ~ "*", p_value < 0.1 ~ ".",
TRUE ~ ""
)
) %>%
mutate(across(c(estimate,std_error,z_value,p_value,IRR,CI_low,CI_high,perubahan_persen), ~round(.x,4)))
pois_coef %>%
knitr::kable(
caption = "Ringkasan Hasil Regresi Poisson",
col.names = c("Parameter","Estimate","SE","z-value","p-value","IRR","CI 2.5%","CI 97.5%","Perubahan (%)","Sig.")
) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"), full_width = TRUE)| Parameter | Estimate | SE | z-value | p-value | IRR | CI 2.5% | CI 97.5% | Perubahan (%) | Sig. |
|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 1.7888 | 1.0839 | 1.6504 | 0.0989 | 5.9825 | 0.7149 | 50.0605 | 498.2490 | . |
| severityHigh | -0.5697 | 0.2531 | -2.2507 | 0.0244 | 0.5657 | 0.3444 | 0.9291 | -43.4308 |
|
| severityLow | -21.3401 | 2784.9698 | -0.0077 | 0.9939 | 0.0000 | 0.0000 | Inf | -100.0000 | |
| severityMedium | -0.8193 | 0.4183 | -1.9587 | 0.0501 | 0.4407 | 0.1942 | 1.0005 | -55.9258 | . |
| road_conditionDry | 0.2834 | 0.5420 | 0.5229 | 0.6011 | 1.3276 | 0.4589 | 3.8406 | 32.7615 | |
| road_conditionFlooding | 0.1385 | 0.5141 | 0.2694 | 0.7876 | 1.1486 | 0.4193 | 3.1459 | 14.8559 | |
| road_conditionMuddy | 0.3317 | 0.5954 | 0.5572 | 0.5774 | 1.3934 | 0.4338 | 4.4757 | 39.3396 | |
| road_conditionPotholed | 0.0789 | 0.3976 | 0.1984 | 0.8428 | 1.0821 | 0.4964 | 2.3589 | 8.2073 | |
| road_conditionWet | 0.0882 | 0.5407 | 0.1632 | 0.8704 | 1.0922 | 0.3785 | 3.1519 | 9.2232 | |
| weatherCloudy | -0.0201 | 0.2725 | -0.0736 | 0.9413 | 0.9801 | 0.5745 | 1.6721 | -1.9862 | |
| weatherDust Storm | 0.1736 | 0.8193 | 0.2119 | 0.8322 | 1.1896 | 0.2388 | 5.9261 | 18.9600 | |
| weatherFog | 0.8000 | 0.7516 | 1.0643 | 0.2872 | 2.2256 | 0.5101 | 9.7108 | 122.5566 | |
| weatherHeavy Rain | 0.4969 | 0.7122 | 0.6977 | 0.4854 | 1.6436 | 0.4070 | 6.6375 | 64.3563 | |
| vehicles_involved | -0.0180 | 0.0764 | -0.2362 | 0.8133 | 0.9821 | 0.8456 | 1.1407 | -1.7879 | |
| fatalities | 0.0963 | 0.1239 | 0.7773 | 0.4370 | 1.1011 | 0.8637 | 1.4039 | 10.1122 | |
| accident_causeHuman Error | -0.6921 | 0.8132 | -0.8511 | 0.3947 | 0.5005 | 0.1017 | 2.4640 | -49.9463 | |
| accident_causeMechanical Failure | -0.8933 | 0.8756 | -1.0202 | 0.3076 | 0.4093 | 0.0736 | 2.2770 | -59.0696 | |
| accident_causePoor Road | -0.8269 | 0.8960 | -0.9229 | 0.3561 | 0.4374 | 0.0755 | 2.5325 | -56.2593 | |
| accident_causeSignal Violation | -0.7722 | 0.8510 | -0.9073 | 0.3642 | 0.4620 | 0.0871 | 2.4494 | -53.7983 | |
| accident_causeWeather | -1.0906 | 0.9897 | -1.1020 | 0.2705 | 0.3360 | 0.0483 | 2.3375 | -66.3992 | |
| traffic_densityLight | -0.4919 | 0.5360 | -0.9176 | 0.3588 | 0.6115 | 0.2139 | 1.7485 | -38.8514 | |
| traffic_densityModerate | -0.3719 | 0.5026 | -0.7399 | 0.4593 | 0.6894 | 0.2574 | 1.8464 | -31.0591 | |
| lane_utilizationLane Change | 0.6845 | 0.4176 | 1.6393 | 0.1012 | 1.9829 | 0.8747 | 4.4953 | 98.2877 | |
| lane_utilizationOvertaking | 0.8180 | 0.4737 | 1.7268 | 0.0842 | 2.2659 | 0.8954 | 5.7338 | 126.5885 | . |
| nearby_accidents | 0.0036 | 0.0101 | 0.3568 | 0.7212 | 1.0036 | 0.9839 | 1.0238 | 0.3627 |
dispersion_pois <- sum(residuals(fit_pois, type = "pearson")^2) / df.residual(fit_pois)
tibble::tibble(
`Dispersion Pearson` = round(dispersion_pois, 4),
Interpretasi = dplyr::case_when(
dispersion_pois < 1.5 ~ "β
Tidak ada indikasi overdispersion berat",
dispersion_pois < 2.5 ~ "β οΈ Indikasi overdispersion sedang β pertimbangkan Negative Binomial",
TRUE ~ "π¨ Overdispersion kuat β sangat disarankan model Negative Binomial"
)
) %>%
knitr::kable(caption = "Uji Overdispersion (Rasio Dispersion Pearson)") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)| Dispersion Pearson | Interpretasi |
|---|---|
| 0.3242 | β Tidak ada indikasi overdispersion berat | |
get_typical <- function(x) {
if (is.numeric(x)) return(mean(x, na.rm = TRUE))
t <- table(x); return(names(t)[which.max(t)])
}
baseline_data <- as.data.frame(lapply(data_poi[var_x_poi], get_typical))
grid_pois <- baseline_data[rep(1, 100), ]
grid_pois$vehicles_involved <- seq(
min(data_poi$vehicles_involved, na.rm = TRUE),
max(data_poi$vehicles_involved, na.rm = TRUE),
length.out = 100
)
pred_pois <- predict(fit_pois, newdata = grid_pois, type = "link", se.fit = TRUE)
grid_pois %>%
mutate(
rate = exp(pred_pois$fit),
rate_low = exp(pred_pois$fit - 1.96 * pred_pois$se.fit),
rate_high = exp(pred_pois$fit + 1.96 * pred_pois$se.fit)
) %>%
ggplot(aes(x = vehicles_involved, y = rate)) +
geom_ribbon(aes(ymin = rate_low, ymax = rate_high), fill = "#2f7f73", alpha = 0.16) +
geom_line(color = "#2f7f73", linewidth = 1.35) +
labs(title = "Prediksi Jumlah Cedera vs Jumlah Kendaraan",
subtitle = "Variabel lain ditahan pada nilai rata-rata/modus",
x = "Jumlah Kendaraan Terlibat", y = "Prediksi Cedera (Injuries)") +
theme_minimal()# OLS hanya pada observasi Y > 0
data_poi_pos <- data_poi %>% filter(.data[[var_y_poi]] > 0)
formula_log_ols <- as.formula(paste("log(", var_y_poi, ") ~", paste(var_x_poi, collapse = " + ")))
fit_log_ols <- lm(formula_log_ols, data = data_poi_pos)
tibble::tibble(
Aspek = c("Jenis respons","Nilai nol","Target model","Fungsi link","Interpretasi koef.","Prediksi","Kapan lebih tepat?"),
`Regresi Poisson` = c(
"Hitungan non-negatif","Dapat langsung menangani Y = 0",
"E(Y|X)","log{E(Y|X)} = XΞ²","IRR = exp(Ξ²)",
"Selalu non-negatif","Count data, banyak nol, rate kejadian"
),
`OLS pada log(Y)` = c(
"Kontinu positif atau hitungan besar tanpa nol","Tidak dapat memakai Y = 0 tanpa modifikasi",
"E(log Y|X)","log(Y) = XΞ² + Ξ΅","Multiplier pada geometric mean",
"Perlu retransformation","Y positif, log-scale residual baik"
)
) %>%
knitr::kable(caption = "Perbandingan Regresi Poisson vs OLS pada log(Y)") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"), full_width = TRUE)| Aspek | Regresi Poisson | OLS pada log(Y) |
|---|---|---|
| Jenis respons | Hitungan non-negatif | Kontinu positif atau hitungan besar tanpa nol |
| Nilai nol | Dapat langsung menangani Y = 0 | Tidak dapat memakai Y = 0 tanpa modifikasi |
| Target model | E(Y|X) | E(log Y|X) |
| Fungsi link | log{E(Y|X)} = XΞ² | log(Y) = XΞ² + Ξ΅ |
| Interpretasi koef. | IRR = exp(Ξ²) | Multiplier pada geometric mean |
| Prediksi | Selalu non-negatif | Perlu retransformation |
| Kapan lebih tepat? | Count data, banyak nol, rate kejadian | Y positif, log-scale residual baik |
π‘ Interpretasi β Regresi Poisson
Sesuaikan teks berikut dengan IRR dari output kamu:
β’
IRR (Incidence Rate Ratio): Jika IRR untuk
vehicles_involved = 1.35, artinya setiap penambahan 1 kendaraan
yang terlibat meningkatkan rata-rata jumlah cedera sebesar 35%.
β’
IRR < 1: Variabel tersebut berkaitan dengan
penurunan rata-rata hitungan.
β’ IRR = 1: Variabel
tidak berpengaruh pada rata-rata hitungan.
β’
Overdispersion: Jika Dispersion Pearson jauh di atas 1
(misal > 2.5), model Negative Binomial (MASS::glm.nb())
lebih tepat.
β’ Zero-inflation: Jika proporsi Y = 0
sangat tinggi (> 30%), pertimbangkan model Zero-Inflated Poisson
(ZIP).
tibble::tibble(
`Model` = c("Logistik Multinomial","Logistik Ordinal","Poisson"),
`Dataset` = c("Iris","Student Grade","Traffic Accidents"),
`Jenis Respons` = c("Nominal β₯ 3 kat.","Ordinal β₯ 3 kat.","Count (hitungan)"),
`Fungsi Link` = c("logit (baseline)","logit (kumulatif)","log"),
`Fungsi R` = c("nnet::multinom()","MASS::polr()","glm(..., poisson)"),
`Interpretasi Ξ²` = c("Rel. Risk Ratio (RRR)","Cumul. Odds Ratio","Inc. Rate Ratio (IRR)"),
`Asumsi Utama` = c(
"IIA (Independence of Irrelevant Alternatives)",
"Proportional odds",
"Mean = Variance (equidispersion)"
)
) %>%
knitr::kable(caption = "Perbandingan Ketiga Model Regresi Logistik") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"), full_width = TRUE)| Model | Dataset | Jenis Respons | Fungsi Link | Fungsi R | Interpretasi Ξ² | Asumsi Utama |
|---|---|---|---|---|---|---|
| Logistik Multinomial | Iris | Nominal β₯ 3 kat. | logit (baseline) | nnet::multinom() | Rel. Risk Ratio (RRR) | IIA (Independence of Irrelevant Alternatives) |
| Logistik Ordinal | Student Grade | Ordinal β₯ 3 kat. | logit (kumulatif) | MASS::polr() | Cumul. Odds Ratio | Proportional odds |
| Poisson | Traffic Accidents | Count (hitungan) | log | glm(β¦, poisson) | Inc.Β Rate Ratio (IRR) | Mean = Variance (equidispersion) |
π Kesimpulan Umum
Pemilihan model regresi yang tepat bergantung pada skala
pengukuran variabel respons:
β’ Gunakan
Logistik Multinomial saat Y memiliki 3+ kategori
tanpa urutan alami.
β’ Gunakan Logistik
Ordinal saat Y memiliki 3+ kategori dengan urutan
alami (Rendah < Sedang < Tinggi).
β’ Gunakan
Regresi Poisson saat Y berupa hitungan non-negatif (0,
1, 2, β¦).
Pelanggaran asumsi perlu dideteksi dan ditangani:
overdispersion pada Poisson β Negative Binomial; pelanggaran parallel
lines pada Ordinal β Partial Proportional Odds.