Dokumen ini menyajikan tiga studi kasus analisis regresi yang
diterapkan pada satu dataset yang sama, yaitu data
prestasi siswa matematika (student-mat.csv). Dataset ini
berisi informasi demografis, sosial, dan akademik dari 395 siswa sekolah
menengah di Portugal.
Ketiga model dipilih berdasarkan jenis variabel respons yang dianalisis:
| Studi Kasus | Model | Variabel Y | Jenis Y |
|---|---|---|---|
| 1 | Regresi Logistik Multinomial | Pekerjaan ibu (Mjob) |
Nominal (5 kategori) |
| 2 | Regresi Logistik Ordinal | Tingkat prestasi (prestasi) |
Ordinal (3 kategori) |
| 3 | Regresi Poisson | Jumlah ketidakhadiran (absences) |
Count (bilangan cacah) |
| school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | course | mother | 2 | 2 | 0 | yes | no | no | no | yes | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 6 | 5 | 6 | 6 |
| GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | course | father | 1 | 2 | 0 | no | yes | no | no | no | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 4 | 5 | 5 | 6 |
| GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | other | mother | 1 | 2 | 3 | yes | no | yes | no | yes | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 10 | 7 | 8 | 10 |
| GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | home | mother | 1 | 3 | 0 | no | yes | yes | yes | yes | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 2 | 15 | 14 | 15 |
| GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | home | father | 1 | 2 | 0 | no | yes | yes | no | yes | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 4 | 6 | 10 | 10 |
| GP | M | 16 | U | LE3 | T | 4 | 3 | services | other | reputation | mother | 1 | 2 | 0 | no | yes | yes | yes | yes | yes | yes | no | 5 | 4 | 2 | 1 | 2 | 5 | 10 | 15 | 15 | 15 |
## 'data.frame': 395 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## [1] 395 33
Variabel kategorikal dikonversi menjadi factor agar
dapat diproses dengan benar oleh fungsi-fungsi pemodelan.
d1$school <- as.factor(d1$school)
d1$sex <- as.factor(d1$sex)
d1$address <- as.factor(d1$address)
d1$famsize <- as.factor(d1$famsize)
d1$Pstatus <- as.factor(d1$Pstatus)
d1$Mjob <- as.factor(d1$Mjob)
d1$Fjob <- as.factor(d1$Fjob)
d1$reason <- as.factor(d1$reason)
d1$guardian <- as.factor(d1$guardian)
d1$schoolsup <- as.factor(d1$schoolsup)
d1$famsup <- as.factor(d1$famsup)
d1$paid <- as.factor(d1$paid)
d1$activities <- as.factor(d1$activities)
d1$nursery <- as.factor(d1$nursery)
d1$higher <- as.factor(d1$higher)
d1$internet <- as.factor(d1$internet)
d1$romantic <- as.factor(d1$romantic)
str(d1)## 'data.frame': 395 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## school sex age address famsize Pstatus Medu
## GP:349 F:208 Min. :15.0 R: 88 GT3:281 A: 41 Min. :0.000
## MS: 46 M:187 1st Qu.:16.0 U:307 LE3:114 T:354 1st Qu.:2.000
## Median :17.0 Median :3.000
## Mean :16.7 Mean :2.749
## 3rd Qu.:18.0 3rd Qu.:4.000
## Max. :22.0 Max. :4.000
## Fedu Mjob Fjob reason guardian
## Min. :0.000 at_home : 59 at_home : 20 course :145 father: 90
## 1st Qu.:2.000 health : 34 health : 18 home :109 mother:273
## Median :2.000 other :141 other :217 other : 36 other : 32
## Mean :2.522 services:103 services:111 reputation:105
## 3rd Qu.:3.000 teacher : 58 teacher : 29
## Max. :4.000
## traveltime studytime failures schoolsup famsup paid
## Min. :1.000 Min. :1.000 Min. :0.0000 no :344 no :153 no :214
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:242 yes:181
## Median :1.000 Median :2.000 Median :0.0000
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## activities nursery higher internet romantic famrel
## no :194 no : 81 no : 20 no : 66 no :263 Min. :1.000
## yes:201 yes:314 yes:375 yes:329 yes:132 1st Qu.:4.000
## Median :4.000
## Mean :3.944
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :1.000 Median :2.000
## Mean :3.235 Mean :3.109 Mean :1.481 Mean :2.291
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## health absences G1 G2
## Min. :1.000 Min. : 0.000 Min. : 3.00 Min. : 0.00
## 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9.00
## Median :4.000 Median : 4.000 Median :11.00 Median :11.00
## Mean :3.554 Mean : 5.709 Mean :10.91 Mean :10.71
## 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :5.000 Max. :75.000 Max. :19.00 Max. :19.00
## G3
## Min. : 0.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :10.42
## 3rd Qu.:14.00
## Max. :20.00
## school sex age address famsize Pstatus Medu
## 0 0 0 0 0 0 0
## Fedu Mjob Fjob reason guardian traveltime studytime
## 0 0 0 0 0 0 0
## failures schoolsup famsup paid activities nursery higher
## 0 0 0 0 0 0 0
## internet romantic famrel freetime goout Dalc Walc
## 0 0 0 0 0 0 0
## health absences G1 G2 G3
## 0 0 0 0 0
Tidak ditemukan nilai hilang (missing value) pada dataset ini. Total observasi: 395 siswa, 33 variabel.
Pertanyaan penelitian: Apakah pendidikan ayah dan ibu, pekerjaan ayah, jenis alamat, dan asal sekolah memengaruhi probabilitas kategori pekerjaan ibu siswa?
Variabel yang digunakan:
| Peran | Variabel | Keterangan |
|---|---|---|
| Respons (Y) | Mjob |
Pekerjaan ibu: at_home, health, other, services, teacher |
| Prediktor (X) | Medu |
Tingkat pendidikan ibu (0–4) |
Fedu |
Tingkat pendidikan ayah (0–4) | |
Fjob |
Pekerjaan ayah | |
address |
Jenis alamat (U = Urban, R = Rural) | |
school |
Asal sekolah (GP / MS) |
Alasan pemilihan model: Variabel Mjob
bersifat nominal — tidak ada urutan alami antar
kategorinya — sehingga tepat dimodelkan dengan regresi logistik
multinomial. Model menghasilkan log-odds tiap kategori relatif terhadap
baseline (at_home).
d1 %>%
count(Mjob) %>%
mutate(proporsi = n / sum(n)) %>%
kable(digits = 3, caption = "Distribusi pekerjaan ibu siswa (Mjob)") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = FALSE)| Mjob | n | proporsi |
|---|---|---|
| at_home | 59 | 0.149 |
| health | 34 | 0.086 |
| other | 141 | 0.357 |
| services | 103 | 0.261 |
| teacher | 58 | 0.147 |
d1 %>%
count(Mjob) %>%
mutate(proporsi = n / sum(n)) %>%
ggplot(aes(x = Mjob, y = proporsi, fill = Mjob)) +
geom_col(width = 0.65, alpha = 0.94) +
geom_text(aes(label = percent(proporsi, accuracy = 0.1)),
vjust = -0.4, fontface = "bold") +
scale_y_continuous(labels = percent_format(), limits = c(0, 0.50)) +
scale_fill_manual(values = c("#3d6cb5","#5568b8","#7b8fc4","#a0aed4","#c5cfe7")) +
labs(title = "Distribusi Pekerjaan Ibu (Mjob)",
subtitle = "Respons multinomial: kategori tidak memiliki urutan alami.",
x = NULL, y = "Proporsi") +
theme_minimal() +
theme(legend.position = "none")Kategori terbanyak adalah other (35.7%), diikuti services (26.1%). Kategori health paling sedikit (8.6%). Ketidakseimbangan ini perlu diperhatikan dalam interpretasi.
fit_multi1 <- nnet::multinom(
Mjob ~ Medu + Fedu + Fjob + address + school,
data = d1,
trace = FALSE
)
summary(fit_multi1)## Call:
## nnet::multinom(formula = Mjob ~ Medu + Fedu + Fjob + address +
## school, data = d1, trace = FALSE)
##
## Coefficients:
## (Intercept) Medu Fedu Fjobhealth Fjobother Fjobservices
## health -23.625990 3.292148 -0.7507459 17.0765553 14.4795881 15.475192
## other -2.259777 1.076036 -0.4156804 0.7646605 1.4982127 1.043269
## services -3.497796 1.687781 -0.5296715 1.3216088 0.3314372 1.602217
## teacher -16.744749 5.251466 -0.7996462 3.4310301 1.2423027 2.823853
## Fjobteacher addressU schoolMS
## health 12.7754395 1.3007337 0.02253365
## other 0.9040003 0.7277724 0.44136542
## services 0.4615745 0.9633206 -0.19141750
## teacher 1.6020656 0.2982862 0.22807732
##
## Std. Errors:
## (Intercept) Medu Fedu Fjobhealth Fjobother Fjobservices
## health 1.2084714 0.4514986 0.3182512 0.940695 0.5388472 0.5788467
## other 0.8587155 0.2435597 0.2096895 1.237330 0.6775827 0.7267809
## services 0.9407783 0.2726429 0.2321159 1.192889 0.7164090 0.7498780
## teacher 2.7393358 0.7088261 0.3232948 1.603415 1.0966165 1.1392554
## Fjobteacher addressU schoolMS
## health 1.178502 0.7527853 0.9558460
## other 1.099339 0.3780216 0.5007841
## services 1.129457 0.4318906 0.5938807
## teacher 1.378014 0.6275073 0.8189497
##
## Residual Deviance: 883.4601
## AIC: 955.4601
RRR (Relative Risk Ratio) = exp(koefisien), mengukur berapa kali lebih mungkin suatu kategori dibanding baseline (at_home) untuk setiap kenaikan 1 satuan prediktor.
multi_sum1 <- summary(fit_multi1)
coef_multi1 <- as.data.frame(multi_sum1$coefficients)
se_multi1 <- as.data.frame(multi_sum1$standard.errors)
coef_long1 <- coef_multi1 %>%
tibble::rownames_to_column("kategori") %>%
tidyr::pivot_longer(cols = -kategori, names_to = "variabel", values_to = "estimate")
se_long1 <- se_multi1 %>%
tibble::rownames_to_column("kategori") %>%
tidyr::pivot_longer(cols = -kategori, names_to = "variabel", values_to = "std_error")
result_multi1 <- coef_long1 %>%
left_join(se_long1, by = c("kategori","variabel")) %>%
mutate(
z_value = estimate / std_error,
p_value = 2 * (1 - pnorm(abs(z_value))),
RRR = exp(estimate),
CI_low = exp(estimate - 1.96 * std_error),
CI_high = exp(estimate + 1.96 * std_error)
)
result_multi1 %>%
mutate(across(c(estimate, std_error, z_value, p_value, RRR, CI_low, CI_high),
~ round(.x, 4))) %>%
kable(caption = "Ringkasan hasil regresi logistik multinomial (Baseline: at_home)",
col.names = c("Kategori","Variabel","Estimate","SE",
"z-value","p-value","RRR","CI 2.5%","CI 97.5%")) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Kategori | Variabel | Estimate | SE | z-value | p-value | RRR | CI 2.5% | CI 97.5% |
|---|---|---|---|---|---|---|---|---|
| health | (Intercept) | -23.6260 | 1.2085 | -19.5503 | 0.0000 | 0.000000e+00 | 0.0000 | 0.000000e+00 |
| health | Medu | 3.2921 | 0.4515 | 7.2916 | 0.0000 | 2.690060e+01 | 11.1030 | 6.517560e+01 |
| health | Fedu | -0.7507 | 0.3183 | -2.3590 | 0.0183 | 4.720000e-01 | 0.2530 | 8.808000e-01 |
| health | Fjobhealth | 17.0766 | 0.9407 | 18.1531 | 0.0000 | 2.607677e+07 | 4125892.8743 | 1.648123e+08 |
| health | Fjobother | 14.4796 | 0.5388 | 26.8714 | 0.0000 | 1.942698e+06 | 675661.5187 | 5.585746e+06 |
| health | Fjobservices | 15.4752 | 0.5788 | 26.7345 | 0.0000 | 5.257638e+06 | 1690699.3888 | 1.634989e+07 |
| health | Fjobteacher | 12.7754 | 1.1785 | 10.8404 | 0.0000 | 3.534296e+05 | 35086.6167 | 3.560117e+06 |
| health | addressU | 1.3007 | 0.7528 | 1.7279 | 0.0840 | 3.672000e+00 | 0.8397 | 1.605780e+01 |
| health | schoolMS | 0.0225 | 0.9558 | 0.0236 | 0.9812 | 1.022800e+00 | 0.1571 | 6.659100e+00 |
| other | (Intercept) | -2.2598 | 0.8587 | -2.6316 | 0.0085 | 1.044000e-01 | 0.0194 | 5.618000e-01 |
| other | Medu | 1.0760 | 0.2436 | 4.4180 | 0.0000 | 2.933000e+00 | 1.8197 | 4.727600e+00 |
| other | Fedu | -0.4157 | 0.2097 | -1.9824 | 0.0474 | 6.599000e-01 | 0.4375 | 9.953000e-01 |
| other | Fjobhealth | 0.7647 | 1.2373 | 0.6180 | 0.5366 | 2.148300e+00 | 0.1900 | 2.428420e+01 |
| other | Fjobother | 1.4982 | 0.6776 | 2.2111 | 0.0270 | 4.473700e+00 | 1.1855 | 1.688250e+01 |
| other | Fjobservices | 1.0433 | 0.7268 | 1.4355 | 0.1512 | 2.838500e+00 | 0.6830 | 1.179600e+01 |
| other | Fjobteacher | 0.9040 | 1.0993 | 0.8223 | 0.4109 | 2.469500e+00 | 0.2863 | 2.130000e+01 |
| other | addressU | 0.7278 | 0.3780 | 1.9252 | 0.0542 | 2.070500e+00 | 0.9869 | 4.343600e+00 |
| other | schoolMS | 0.4414 | 0.5008 | 0.8813 | 0.3781 | 1.554800e+00 | 0.5826 | 4.149100e+00 |
| services | (Intercept) | -3.4978 | 0.9408 | -3.7180 | 0.0002 | 3.030000e-02 | 0.0048 | 1.913000e-01 |
| services | Medu | 1.6878 | 0.2726 | 6.1904 | 0.0000 | 5.407500e+00 | 3.1690 | 9.227200e+00 |
| services | Fedu | -0.5297 | 0.2321 | -2.2819 | 0.0225 | 5.888000e-01 | 0.3736 | 9.280000e-01 |
| services | Fjobhealth | 1.3216 | 1.1929 | 1.1079 | 0.2679 | 3.749400e+00 | 0.3619 | 3.884850e+01 |
| services | Fjobother | 0.3314 | 0.7164 | 0.4626 | 0.6436 | 1.393000e+00 | 0.3421 | 5.672300e+00 |
| services | Fjobservices | 1.6022 | 0.7499 | 2.1366 | 0.0326 | 4.964000e+00 | 1.1416 | 2.158460e+01 |
| services | Fjobteacher | 0.4616 | 1.1295 | 0.4087 | 0.6828 | 1.586600e+00 | 0.1734 | 1.451690e+01 |
| services | addressU | 0.9633 | 0.4319 | 2.2305 | 0.0257 | 2.620400e+00 | 1.1239 | 6.109400e+00 |
| services | schoolMS | -0.1914 | 0.5939 | -0.3223 | 0.7472 | 8.258000e-01 | 0.2578 | 2.644800e+00 |
| teacher | (Intercept) | -16.7447 | 2.7393 | -6.1127 | 0.0000 | 0.000000e+00 | 0.0000 | 0.000000e+00 |
| teacher | Medu | 5.2515 | 0.7088 | 7.4087 | 0.0000 | 1.908458e+02 | 47.5683 | 7.656806e+02 |
| teacher | Fedu | -0.7996 | 0.3233 | -2.4734 | 0.0134 | 4.495000e-01 | 0.2385 | 8.471000e-01 |
| teacher | Fjobhealth | 3.4310 | 1.6034 | 2.1398 | 0.0324 | 3.090850e+01 | 1.3342 | 7.160310e+02 |
| teacher | Fjobother | 1.2423 | 1.0966 | 1.1329 | 0.2573 | 3.463600e+00 | 0.4037 | 2.971560e+01 |
| teacher | Fjobservices | 2.8239 | 1.1393 | 2.4787 | 0.0132 | 1.684160e+01 | 1.8056 | 1.570861e+02 |
| teacher | Fjobteacher | 1.6021 | 1.3780 | 1.1626 | 0.2450 | 4.963300e+00 | 0.3333 | 7.391920e+01 |
| teacher | addressU | 0.2983 | 0.6275 | 0.4754 | 0.6345 | 1.347500e+00 | 0.3939 | 4.609900e+00 |
| teacher | schoolMS | 0.2281 | 0.8189 | 0.2785 | 0.7806 | 1.256200e+00 | 0.2523 | 6.254000e+00 |
## at_home health other services teacher
## 1 0.023461254 0.043429695 0.17569802 0.3033497 4.540613e-01
## 2 0.309801087 0.001539819 0.57969019 0.1089623 6.628279e-06
## 3 0.309801087 0.001539819 0.57969019 0.1089623 6.628279e-06
## 4 0.001706192 0.210882586 0.03372738 0.1990956 5.545882e-01
## 5 0.079784341 0.063934974 0.55925243 0.2844669 1.256137e-02
## 6 0.010817472 0.233189202 0.22239874 0.2085613 3.250333e-01
d1_pred1 <- d1 %>%
bind_cols(as.data.frame(pred_prob_multi1)) %>%
mutate(prediksi = predict(fit_multi1, type = "class"))
head(d1_pred1)| school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health…29 | absences | G1 | G2 | G3 | at_home | health…35 | other | services | teacher | prediksi |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | course | mother | 2 | 2 | 0 | yes | no | no | no | yes | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 6 | 5 | 6 | 6 | 0.0234613 | 0.0434297 | 0.1756980 | 0.3033497 | 0.4540613 | teacher |
| GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | course | father | 1 | 2 | 0 | no | yes | no | no | no | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 4 | 5 | 5 | 6 | 0.3098011 | 0.0015398 | 0.5796902 | 0.1089623 | 0.0000066 | other |
| GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | other | mother | 1 | 2 | 3 | yes | no | yes | no | yes | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 10 | 7 | 8 | 10 | 0.3098011 | 0.0015398 | 0.5796902 | 0.1089623 | 0.0000066 | other |
| GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | home | mother | 1 | 3 | 0 | no | yes | yes | yes | yes | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 2 | 15 | 14 | 15 | 0.0017062 | 0.2108826 | 0.0337274 | 0.1990956 | 0.5545882 | teacher |
| GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | home | father | 1 | 2 | 0 | no | yes | yes | no | yes | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 4 | 6 | 10 | 10 | 0.0797843 | 0.0639350 | 0.5592524 | 0.2844669 | 0.0125614 | other |
| GP | M | 16 | U | LE3 | T | 4 | 3 | services | other | reputation | mother | 1 | 2 | 0 | no | yes | yes | yes | yes | yes | yes | no | 5 | 4 | 2 | 1 | 2 | 5 | 10 | 15 | 15 | 15 | 0.0108175 | 0.2331892 | 0.2223987 | 0.2085613 | 0.3250333 | teacher |
## Prediksi
## Aktual at_home health other services teacher
## at_home 22 0 28 8 1
## health 0 6 9 3 16
## other 15 1 90 16 19
## services 4 1 38 38 22
## teacher 0 4 4 2 48
## [1] 0.5164557
grid_multi1 <- expand.grid(
Medu = seq(min(d1$Medu), max(d1$Medu), length.out = 100),
Fedu = median(d1$Fedu),
Fjob = "other",
address = "U",
school = "GP"
)
grid_prob_multi1 <- predict(fit_multi1, newdata = grid_multi1, type = "probs")
grid_multi1_plot <- grid_multi1 %>%
bind_cols(as.data.frame(grid_prob_multi1)) %>%
tidyr::pivot_longer(cols = c("at_home","health","other","services","teacher"),
names_to = "Mjob", values_to = "probabilitas")
ggplot(grid_multi1_plot, aes(x = Medu, y = probabilitas, color = Mjob)) +
geom_line(linewidth = 1.35) +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = c("#2f7f73","#5568b8","#d18b2f","#c0392b","#8e44ad")) +
labs(title = "Prediksi Probabilitas Pekerjaan Ibu (Mjob)",
subtitle = "Variabel lain ditahan pada nilai modus; Fjob = other, address = U, school = GP.",
x = "Pendidikan ibu (Medu)", y = "Probabilitas prediksi", color = "Pekerjaan Ibu") +
theme_minimal()Multikolinearitas diperiksa menggunakan VIF (Variance
Inflation Factor). Karena nnet::multinom() tidak
memiliki fungsi VIF langsung, digunakan model regresi linear bantu pada
tiap prediktor numerik.
# Model bantu menggunakan lm() untuk menghitung VIF prediktor numerik
vif_model <- lm(Medu ~ Fedu + address + school, data = d1)
car::vif(vif_model)## Fedu address school
## 1.009148 1.087887 1.089157
Panduan interpretasi VIF:
Medu dan Fedu berpotensi berkorelasi karena
sama-sama mengukur tingkat pendidikan orang tua. Jika VIF tinggi,
pertimbangkan untuk hanya menggunakan salah satu variabel.
IIA (Independence of Irrelevant Alternatives) adalah asumsi utama model multinomial logit: peluang relatif antara dua kategori tidak berubah meskipun kategori lain ditambahkan atau dihilangkan.
Implikasi pada studi kasus ini:
Model mengasumsikan bahwa peluang relatif ibu bekerja sebagai
health vs at_home tidak terpengaruh oleh ada
atau tidaknya kategori teacher. Asumsi ini bisa bermasalah
jika kategori-kategori pekerjaan saling mensubstitusi — misalnya
services dan other yang mungkin tumpang tindih
secara konseptual.
Cara deteksi: Uji Hausman-McFadden (tersedia di
paket mlogit). Untuk keperluan studi kasus ini, IIA
diasumsikan terpenuhi namun perlu diingat sebagai keterbatasan.
Medu (pendidikan ibu) — semakin tinggi
pendidikan ibu, semakin besar kemungkinan ia bekerja sebagai
teacher atau health dibanding
at_home.Fjob (pekerjaan ayah) — pekerjaan ayah
berkorelasi dengan pekerjaan ibu di beberapa kategori.health cenderung sulit diprediksi. RRR yang sangat ekstrem
(sangat besar atau sangat kecil) menandakan kemungkinan adanya
perfect separation atau ketidakseimbangan kelas yang kuat.⚠️ Catatan: Model multinomial dapat menghasilkan RRR
yang sangat ekstrem jika ada kategori dengan frekuensi sangat kecil
(seperti health = 8.6%). Interpretasi koefisien untuk
kategori tersebut harus dilakukan dengan hati-hati.
Kesimpulan: Model regresi logistik multinomial
berhasil diestimasi dengan lima kategori respons. Variabel pendidikan
ibu (Medu) dan pekerjaan ayah (Fjob) tampak
menjadi prediktor paling informatif dalam membedakan antar kategori
pekerjaan ibu.
Keterbatasan:
health hanya 8.6% dari data, sehingga prediksi untuk
kategori ini cenderung kurang akurat dan koefisiennya tidak stabil.Medu dan
Fedu berpotensi berkorelasi tinggi, sehingga interpretasi
koefisien masing-masing perlu dilakukan dengan hati-hati.Pertanyaan penelitian: Apakah waktu belajar, riwayat kegagalan, jumlah ketidakhadiran, dan akses internet memengaruhi probabilitas tingkat prestasi akademik siswa?
Variabel yang digunakan:
| Peran | Variabel | Keterangan |
|---|---|---|
| Respons (Y) | prestasi |
Dibuat dari G3: Rendah (0–9), Sedang (10–14), Tinggi
(15–20) |
| Prediktor (X) | studytime |
Waktu belajar per minggu (1–4) |
failures |
Jumlah kegagalan kelas sebelumnya (0–3) | |
absences |
Jumlah ketidakhadiran | |
internet |
Akses internet di rumah (yes/no) |
Alasan pemilihan model: Variabel
prestasi memiliki urutan alami (Rendah
< Sedang < Tinggi), sehingga tepat dimodelkan dengan regresi
logistik ordinal menggunakan MASS::polr(). Model ini
mengasumsikan proportional odds — efek prediktor sama di semua
batas kumulatif.
d1 <- d1 %>%
mutate(
prestasi = cut(G3,
breaks = c(-Inf, 9, 14, Inf),
labels = c("Rendah","Sedang","Tinggi"),
ordered_result = TRUE)
)
dplyr::glimpse(d1)## Rows: 395
## Columns: 34
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <fct> at_home, at_home, at_home, health, other, services, other, …
## $ Fjob <fct> teacher, other, other, services, other, other, other, teach…
## $ reason <fct> course, course, other, home, home, reputation, home, home, …
## $ guardian <fct> mother, father, mother, mother, father, mother, mother, mot…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ failures <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, …
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, ye…
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes,…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no…
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, ye…
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, ye…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
## $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
## $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
## $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 14,…
## $ prestasi <ord> Rendah, Rendah, Sedang, Tinggi, Sedang, Tinggi, Sedang, Ren…
d1 %>%
count(prestasi) %>%
mutate(proporsi = n / sum(n)) %>%
kable(digits = 3, caption = "Distribusi tingkat prestasi akademik siswa") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = FALSE)| prestasi | n | proporsi |
|---|---|---|
| Rendah | 130 | 0.329 |
| Sedang | 192 | 0.486 |
| Tinggi | 73 | 0.185 |
d1 %>%
count(prestasi) %>%
mutate(proporsi = n / sum(n)) %>%
ggplot(aes(x = prestasi, y = proporsi, fill = prestasi)) +
geom_col(width = 0.65, alpha = 0.94) +
geom_text(aes(label = percent(proporsi, accuracy = 0.1)),
vjust = -0.4, fontface = "bold", color = "#243142") +
scale_y_continuous(labels = percent_format(), limits = c(0, 0.55)) +
scale_fill_manual(values = c("#2f7f73","#5568b8","#d18b2f")) +
labs(title = "Distribusi Tingkat Prestasi Akademik",
subtitle = "Respons ordinal: kategori memiliki urutan alami.",
x = NULL, y = "Proporsi") +
theme_minimal() +
theme(legend.position = "none")fit_ord <- MASS::polr(
prestasi ~ studytime + failures + absences + internet,
data = d1,
method = "logistic",
Hess = TRUE
)
summary(fit_ord)## Call:
## MASS::polr(formula = prestasi ~ studytime + failures + absences +
## internet, data = d1, Hess = TRUE, method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## studytime 0.07819 0.11916 0.6561
## failures -0.98287 0.16796 -5.8520
## absences -0.02845 0.01283 -2.2173
## internetyes 0.51706 0.26453 1.9546
##
## Intercepts:
## Value Std. Error t value
## Rendah|Sedang -0.6725 0.3484 -1.9303
## Sedang|Tinggi 1.7450 0.3608 4.8365
##
## Residual Deviance: 753.7981
## AIC: 765.7981
Perhatikan dua konvensi tanda pada output polr():
ord_coef <- coef(summary(fit_ord))
result_ord <- as.data.frame(ord_coef) %>%
tibble::rownames_to_column("parameter") %>%
dplyr::rename(estimate = Value, std_error = `Std. Error`, t_value = `t value`) %>%
mutate(
p_value = 2 * (1 - pnorm(abs(t_value))),
jenis = ifelse(grepl("\\|", parameter), "Cutpoint", "Koefisien"),
estimate_slide = ifelse(jenis == "Koefisien", -estimate, estimate),
OR_polr = ifelse(jenis == "Koefisien", exp(estimate), NA_real_),
OR_slide = ifelse(jenis == "Koefisien", exp(estimate_slide), NA_real_),
CI_low_polr = ifelse(jenis == "Koefisien", exp(estimate - 1.96 * std_error), NA_real_),
CI_high_polr = ifelse(jenis == "Koefisien", exp(estimate + 1.96 * std_error), NA_real_)
)
result_ord %>%
mutate(across(c(estimate, estimate_slide, std_error, t_value, p_value,
OR_polr, OR_slide, CI_low_polr, CI_high_polr),
~ round(.x, 4))) %>%
kable(caption = "Ringkasan hasil regresi logistik ordinal",
col.names = c("Parameter","Estimate polr","SE","z/t-value","p-value",
"Jenis","Estimate slide","OR polr","OR slide",
"CI polr 2.5%","CI polr 97.5%")) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Parameter | Estimate polr | SE | z/t-value | p-value | Jenis | Estimate slide | OR polr | OR slide | CI polr 2.5% | CI polr 97.5% |
|---|---|---|---|---|---|---|---|---|---|---|
| studytime | 0.0782 | 0.1192 | 0.6561 | 0.5117 | Koefisien | -0.0782 | 1.0813 | 0.9248 | 0.8561 | 1.3658 |
| failures | -0.9829 | 0.1680 | -5.8520 | 0.0000 | Koefisien | 0.9829 | 0.3742 | 2.6721 | 0.2693 | 0.5201 |
| absences | -0.0284 | 0.0128 | -2.2173 | 0.0266 | Koefisien | 0.0284 | 0.9720 | 1.0289 | 0.9478 | 0.9967 |
| internetyes | 0.5171 | 0.2645 | 1.9546 | 0.0506 | Koefisien | -0.5171 | 1.6771 | 0.5963 | 0.9986 | 2.8166 |
| Rendah|Sedang | -0.6725 | 0.3484 | -1.9303 | 0.0536 | Cutpoint | -0.6725 | NA | NA | NA | NA |
| Sedang|Tinggi | 1.7450 | 0.3608 | 4.8365 | 0.0000 | Cutpoint | 1.7450 | NA | NA | NA | NA |
## Rendah Sedang Tinggi
## 1 0.3411401 0.5119855 0.14687441
## 2 0.2258012 0.5401113 0.23408758
## 3 0.8684273 0.1182474 0.01332537
## 4 0.2030637 0.5377672 0.25916918
## 5 0.3284704 0.5173821 0.15414751
## 6 0.2570244 0.5380933 0.20488230
d1_pred_ord <- d1 %>%
bind_cols(as.data.frame(pred_prob_ord)) %>%
mutate(prediksi = predict(fit_ord, type = "class"))
head(d1_pred_ord)| school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | prestasi | Rendah | Sedang | Tinggi | prediksi |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | course | mother | 2 | 2 | 0 | yes | no | no | no | yes | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 6 | 5 | 6 | 6 | Rendah | 0.3411401 | 0.5119855 | 0.1468744 | Sedang |
| GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | course | father | 1 | 2 | 0 | no | yes | no | no | no | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 4 | 5 | 5 | 6 | Rendah | 0.2258012 | 0.5401113 | 0.2340876 | Sedang |
| GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | other | mother | 1 | 2 | 3 | yes | no | yes | no | yes | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 10 | 7 | 8 | 10 | Sedang | 0.8684273 | 0.1182474 | 0.0133254 | Rendah |
| GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | home | mother | 1 | 3 | 0 | no | yes | yes | yes | yes | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 2 | 15 | 14 | 15 | Tinggi | 0.2030637 | 0.5377672 | 0.2591692 | Sedang |
| GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | home | father | 1 | 2 | 0 | no | yes | yes | no | yes | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 4 | 6 | 10 | 10 | Sedang | 0.3284704 | 0.5173821 | 0.1541475 | Sedang |
| GP | M | 16 | U | LE3 | T | 4 | 3 | services | other | reputation | mother | 1 | 2 | 0 | no | yes | yes | yes | yes | yes | yes | no | 5 | 4 | 2 | 1 | 2 | 5 | 10 | 15 | 15 | 15 | Tinggi | 0.2570244 | 0.5380933 | 0.2048823 | Sedang |
## Prediksi
## Aktual Rendah Sedang Tinggi
## Rendah 46 84 0
## Sedang 25 167 0
## Tinggi 2 71 0
## [1] 0.5392405
grid_ord <- expand.grid(
studytime = seq(min(d1$studytime), max(d1$studytime), length.out = 120),
failures = mean(d1$failures),
absences = mean(d1$absences),
internet = "yes"
)
grid_prob_ord <- predict(fit_ord, newdata = grid_ord, type = "probs")
grid_ord_plot <- grid_ord %>%
bind_cols(as.data.frame(grid_prob_ord)) %>%
tidyr::pivot_longer(cols = c("Rendah","Sedang","Tinggi"),
names_to = "prestasi", values_to = "probabilitas")
ggplot(grid_ord_plot, aes(x = studytime, y = probabilitas, color = prestasi)) +
geom_line(linewidth = 1.25) +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = c("#2f7f73","#5568b8","#d18b2f")) +
labs(title = "Prediksi Probabilitas Tingkat Prestasi Akademik",
subtitle = "Variabel lain ditahan pada nilai rata-rata; internet = yes.",
x = "Waktu belajar (studytime)", y = "Probabilitas prediksi", color = "Prestasi") +
theme_minimal()Plot ini digunakan untuk memverifikasi asumsi proportional odds secara visual: ketiga garis kumulatif seharusnya sejajar satu sama lain.
eps <- 1e-6
grid_ord_cumlogit <- grid_ord %>%
bind_cols(as.data.frame(grid_prob_ord)) %>%
mutate(
`P(Y <= Rendah)` = Rendah,
`P(Y <= Sedang)` = Rendah + Sedang,
`P(Y <= Tinggi)` = Rendah + Sedang + Tinggi
) %>%
tidyr::pivot_longer(cols = c(`P(Y <= Rendah)`, `P(Y <= Sedang)`, `P(Y <= Tinggi)`),
names_to = "batas_kumulatif",
values_to = "prob_kumulatif") %>%
mutate(prob_kumulatif = pmin(pmax(prob_kumulatif, eps), 1 - eps),
cumulative_logit = qlogis(prob_kumulatif))
ggplot(grid_ord_cumlogit,
aes(x = studytime, y = cumulative_logit, color = batas_kumulatif)) +
geom_line(linewidth = 1.25) +
scale_color_manual(values = c("P(Y <= Rendah)" = "#2f7f73",
"P(Y <= Sedang)" = "#d18b2f",
"P(Y <= Tinggi)" = "#5568b8")) +
labs(title = "Parallel Lines pada Model Ordinal",
subtitle = "Garis yang sejajar mengindikasikan asumsi proportional odds terpenuhi.",
x = "Waktu belajar (studytime)", y = "Logit peluang kumulatif",
color = "Batas kumulatif") +
theme_minimal()Brant Test menguji apakah asumsi proportional odds terpenuhi secara statistik. Jika p-value < 0.05 untuk suatu variabel, asumsi dilanggar untuk variabel tersebut.
Interpretasi Brant Test:
Uji formal menggunakan ordinal::nominal_test() —
membandingkan model dengan dan tanpa asumsi proportional odds untuk tiap
prediktor.
fit_clm <- ordinal::clm(
prestasi ~ studytime + failures + absences + internet,
data = d1,
link = "logit"
)
summary(fit_clm)## formula: prestasi ~ studytime + failures + absences + internet
## data: d1
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 395 -376.90 765.80 5(0) 5.05e-11 2.7e+03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## studytime 0.07819 0.11916 0.656 0.5117
## failures -0.98288 0.16796 -5.852 4.86e-09 ***
## absences -0.02845 0.01283 -2.217 0.0266 *
## internetyes 0.51706 0.26453 1.955 0.0506 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## Rendah|Sedang -0.6725 0.3484 -1.930
## Sedang|Tinggi 1.7450 0.3608 4.836
| Df | logLik | AIC | LRT | Pr(>Chi) | |
|---|---|---|---|---|---|
| NA | -376.8990 | 765.7981 | NA | NA | |
| studytime | 1 | -376.5763 | 767.1527 | 0.6454078 | 0.4217593 |
| failures | 1 | -376.1940 | 766.3880 | 1.4100916 | 0.2350412 |
| absences | 1 | -375.5798 | 765.1596 | 2.6385036 | 0.1043017 |
| internet | 1 | -375.4776 | 764.9553 | 2.8427806 | 0.0917853 |
Interpretasi: Jika p-value nominal_test
untuk semua variabel > 0.05, asumsi proportional odds tidak
ditolak dan model ordinal dapat dipertahankan. Jika ada variabel yang
melanggar, pertimbangkan model ordinal generalisasi.
failures — prediktor paling kuat;
setiap tambahan 1 kegagalan sebelumnya secara signifikan menurunkan
peluang berada di kategori prestasi lebih tinggi.studytime — waktu belajar lebih banyak
meningkatkan peluang berada di kategori Sedang atau Tinggi.absences — ketidakhadiran lebih banyak
menurunkan prestasi, sesuai ekspektasi.internet — akses internet berpengaruh
positif, kemungkinan karena mendukung belajar mandiri.Kesimpulan: Model regresi logistik ordinal berhasil
mengidentifikasi faktor-faktor yang memengaruhi tingkat prestasi siswa
secara bertingkat. Riwayat kegagalan (failures) dan waktu
belajar (studytime) adalah prediktor paling konsisten
signifikan.
Keterbatasan:
nominal_test(). Jika dilanggar,
model parsial atau generalized ordinal lebih tepat digunakan.failures
dan absences berpotensi saling berkorelasi, yang dapat
memengaruhi kestabilan estimasi koefisien.Pertanyaan penelitian: Apakah waktu belajar, riwayat kegagalan, akses internet, dan kondisi kesehatan memengaruhi jumlah ketidakhadiran siswa?
Variabel yang digunakan:
| Peran | Variabel | Keterangan |
|---|---|---|
| Respons (Y) | absences |
Jumlah hari tidak hadir (bilangan cacah ≥ 0) |
| Prediktor (X) | studytime |
Waktu belajar per minggu (1–4) |
failures |
Jumlah kegagalan kelas sebelumnya (0–3) | |
internet |
Akses internet di rumah (yes/no) | |
health |
Status kesehatan siswa (1–5) |
Alasan pemilihan model: Variabel
absences merupakan data hitungan
(count data) — bilangan bulat non-negatif — yang secara alami
dimodelkan dengan distribusi Poisson menggunakan
glm(..., family = poisson(link = "log")).
d1 %>%
count(absences) %>%
mutate(proporsi = n / sum(n)) %>%
kable(digits = 3, caption = "Distribusi jumlah ketidakhadiran siswa (absences)") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = FALSE)| absences | n | proporsi |
|---|---|---|
| 0 | 115 | 0.291 |
| 1 | 3 | 0.008 |
| 2 | 65 | 0.165 |
| 3 | 8 | 0.020 |
| 4 | 53 | 0.134 |
| 5 | 5 | 0.013 |
| 6 | 31 | 0.078 |
| 7 | 7 | 0.018 |
| 8 | 22 | 0.056 |
| 9 | 3 | 0.008 |
| 10 | 17 | 0.043 |
| 11 | 3 | 0.008 |
| 12 | 12 | 0.030 |
| 13 | 3 | 0.008 |
| 14 | 12 | 0.030 |
| 15 | 3 | 0.008 |
| 16 | 7 | 0.018 |
| 17 | 1 | 0.003 |
| 18 | 5 | 0.013 |
| 19 | 1 | 0.003 |
| 20 | 4 | 0.010 |
| 21 | 1 | 0.003 |
| 22 | 3 | 0.008 |
| 23 | 1 | 0.003 |
| 24 | 1 | 0.003 |
| 25 | 1 | 0.003 |
| 26 | 1 | 0.003 |
| 28 | 1 | 0.003 |
| 30 | 1 | 0.003 |
| 38 | 1 | 0.003 |
| 40 | 1 | 0.003 |
| 54 | 1 | 0.003 |
| 56 | 1 | 0.003 |
| 75 | 1 | 0.003 |
d1 %>%
count(absences) %>%
ggplot(aes(x = absences, y = n)) +
geom_col(width = 0.8, fill = "#d18b2f", alpha = 0.85) +
scale_x_continuous(breaks = scales::pretty_breaks()) +
labs(title = "Distribusi Jumlah Ketidakhadiran Siswa",
subtitle = "Respons Poisson berupa hitungan bilangan bulat nonnegatif.",
x = "Jumlah ketidakhadiran (absences)", y = "Frekuensi") +
theme_minimal()fit_pois <- glm(
absences ~ studytime + failures + internet + health,
data = d1,
family = poisson(link = "log")
)
summary(fit_pois)##
## Call:
## glm(formula = absences ~ studytime + failures + internet + health,
## family = poisson(link = "log"), data = d1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.61709 0.10108 15.999 < 2e-16 ***
## studytime -0.10422 0.02635 -3.955 7.64e-05 ***
## failures 0.10412 0.02617 3.979 6.93e-05 ***
## internetyes 0.46255 0.06657 6.948 3.70e-12 ***
## health -0.02924 0.01504 -1.945 0.0518 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3223.3 on 394 degrees of freedom
## Residual deviance: 3133.5 on 390 degrees of freedom
## AIC: 4152.9
##
## Number of Fisher Scoring iterations: 6
IRR (Incidence Rate Ratio) = exp(koefisien). IRR > 1 berarti prediktor meningkatkan jumlah ketidakhadiran; IRR < 1 berarti menurunkan.
pois_coef <- as.data.frame(coef(summary(fit_pois))) %>%
tibble::rownames_to_column("parameter") %>%
dplyr::rename(estimate = Estimate, std_error = `Std. Error`,
z_value = `z value`, p_value = `Pr(>|z|)`) %>%
mutate(
IRR = exp(estimate),
CI_low = exp(estimate - 1.96 * std_error),
CI_high = exp(estimate + 1.96 * std_error),
perubahan_persen = 100 * (IRR - 1)
)
pois_coef %>%
mutate(across(c(estimate, std_error, z_value, p_value,
IRR, CI_low, CI_high, perubahan_persen), ~ round(.x, 4))) %>%
kable(caption = "Ringkasan hasil regresi Poisson",
col.names = c("Parameter","Estimate","SE","z-value","p-value",
"IRR","CI 2.5%","CI 97.5%","Perubahan (%)")) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Parameter | Estimate | SE | z-value | p-value | IRR | CI 2.5% | CI 97.5% | Perubahan (%) |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 1.6171 | 0.1011 | 15.9988 | 0.0000 | 5.0384 | 4.1329 | 6.1423 | 403.8431 |
| studytime | -0.1042 | 0.0263 | -3.9554 | 0.0001 | 0.9010 | 0.8557 | 0.9488 | -9.8975 |
| failures | 0.1041 | 0.0262 | 3.9787 | 0.0001 | 1.1097 | 1.0542 | 1.1681 | 10.9733 |
| internetyes | 0.4626 | 0.0666 | 6.9483 | 0.0000 | 1.5881 | 1.3939 | 1.8095 | 58.8120 |
| health | -0.0292 | 0.0150 | -1.9448 | 0.0518 | 0.9712 | 0.9430 | 1.0002 | -2.8817 |
grid_pois <- expand.grid(
studytime = seq(min(d1$studytime), max(d1$studytime), length.out = 120),
failures = mean(d1$failures),
internet = "yes",
health = mean(d1$health)
)
pred_pois <- predict(fit_pois, newdata = grid_pois, type = "link", se.fit = TRUE)
grid_pois_plot <- grid_pois %>%
mutate(fit_link = pred_pois$fit,
se_link = pred_pois$se.fit,
rate = exp(fit_link),
rate_low = exp(fit_link - 1.96 * se_link),
rate_high = exp(fit_link + 1.96 * se_link))
ggplot(grid_pois_plot, aes(x = studytime, y = rate)) +
geom_ribbon(aes(ymin = rate_low, ymax = rate_high), fill = "#d18b2f", alpha = 0.16) +
geom_line(color = "#d18b2f", linewidth = 1.35) +
labs(title = "Prediksi Jumlah Ketidakhadiran Siswa",
subtitle = "Variabel lain ditahan pada nilai rata-rata; internet = yes.",
x = "Waktu belajar (studytime)", y = "Prediksi jumlah ketidakhadiran") +
theme_minimal()Poisson mengasumsikan mean = variance. Rasio dispersi Pearson mengukur seberapa jauh asumsi ini terpenuhi.
dispersion_pois <- sum(residuals(fit_pois, type = "pearson")^2) / df.residual(fit_pois)
tibble::tibble(
`Dispersion Pearson` = dispersion_pois,
`Interpretasi` = dplyr::case_when(
dispersion_pois < 1.5 ~ "Tidak ada indikasi overdispersion berat",
dispersion_pois < 2.5 ~ "Ada indikasi overdispersion sedang",
TRUE ~ "Ada indikasi overdispersion kuat"
)
) %>%
mutate(`Dispersion Pearson` = round(`Dispersion Pearson`, 3)) %>%
kable(caption = "Indikasi overdispersion pada model Poisson") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = FALSE)| Dispersion Pearson | Interpretasi |
|---|---|
| 10.775 | Ada indikasi overdispersion kuat |
zero_share <- mean(d1$absences == 0)
tibble::tibble(
`Proporsi Y = 0` = zero_share,
`Catatan` = ifelse(
zero_share > 0,
"OLS pada log(Y) akan kehilangan observasi nol atau perlu transformasi log(Y+c).",
"Tidak ada nilai nol, tetapi asumsi residual log-scale tetap perlu diperiksa."
)
) %>%
mutate(`Proporsi Y = 0` = percent(`Proporsi Y = 0`, accuracy = 0.1)) %>%
kable(caption = "Konsekuensi nilai nol untuk model log(Y) dengan OLS") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = FALSE)| Proporsi Y = 0 | Catatan |
|---|---|
| 29.1% | OLS pada log(Y) akan kehilangan observasi nol atau perlu transformasi log(Y+c). |
Ringkasan kondisi absences pada data
ini:
Rasio dispersi 11.2 jauh di atas 1, mengindikasikan overdispersion kuat.
Keterbatasan model Poisson pada data ini:
Model Poisson mengasumsikan mean = variance. Pada data
absences, asumsi ini dilanggar secara serius (rasio ≈
11.2). Akibatnya:
Negative Binomial regression lebih sesuai karena:
Zero-Inflated Poisson (ZIP) bisa dipertimbangkan karena 29.1% siswa memiliki absences = 0, yang mungkin merupakan “structural zeros” (siswa yang memang tidak pernah absen karena alasan sistemik, bukan hanya kebetulan).
Untuk keperluan studi kasus ini, model Poisson tetap dijalankan dan dilaporkan, namun kesimpulan harus disampaikan dengan hati-hati.
d1_positive <- d1 %>% filter(absences > 0)
fit_log_ols <- lm(
log(absences) ~ studytime + failures + internet + health,
data = d1_positive
)
log_ols_coef <- as.data.frame(coef(summary(fit_log_ols))) %>%
tibble::rownames_to_column("parameter") %>%
dplyr::rename(estimate = Estimate, std_error = `Std. Error`,
t_value = `t value`, p_value = `Pr(>|t|)`) %>%
mutate(multiplier = exp(estimate), perubahan_persen = 100 * (multiplier - 1))
log_ols_coef %>%
mutate(across(c(estimate, std_error, t_value, p_value,
multiplier, perubahan_persen), ~ round(.x, 4))) %>%
kable(caption = "Ilustrasi OLS pada log(absences) untuk observasi positif saja",
col.names = c("Parameter","Estimate","SE","t-value","p-value",
"exp(Estimate)","Perubahan (%)")) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Parameter | Estimate | SE | t-value | p-value | exp(Estimate) | Perubahan (%) |
|---|---|---|---|---|---|---|
| (Intercept) | 1.6199 | 0.2103 | 7.7010 | 0.0000 | 5.0525 | 405.2489 |
| studytime | -0.0835 | 0.0587 | -1.4240 | 0.1556 | 0.9199 | -8.0148 |
| failures | 0.2027 | 0.0692 | 2.9304 | 0.0037 | 1.2247 | 22.4732 |
| internetyes | 0.3709 | 0.1288 | 2.8794 | 0.0043 | 1.4490 | 44.9014 |
| health | -0.0269 | 0.0349 | -0.7705 | 0.4416 | 0.9735 | -2.6519 |
tibble::tibble(
Aspek = c("Jenis respons","Nilai nol","Target model","Bentuk mean",
"Interpretasi","Prediksi","Exposure","Kapan lebih tepat?"),
`Regresi Poisson` = c(
"Hitungan nonnegatif",
"Dapat langsung menangani Y = 0",
"E(Y | X)",
"log{E(Y | X)} = X beta",
"exp(beta) sebagai incidence rate ratio",
"Selalu nonnegatif",
"Offset log(exposure) sangat alami",
"Count data, banyak nol, rate kejadian, atau target mean hitungan"),
`OLS pada log(Y)` = c(
"Kontinu positif atau hitungan besar tanpa nol",
"Tidak dapat memakai Y = 0 tanpa modifikasi",
"E(log Y | X)",
"log(Y) = X beta + error",
"exp(beta) sebagai multiplier pada geometric mean",
"Perlu retransformation untuk kembali ke skala Y",
"Biasanya memakai log(rate) atau kontrol exposure manual",
"Y positif, log-scale residual baik, dan fokus pada perubahan persentase")
) %>%
kable(caption = "Perbandingan regresi Poisson dan OLS pada log(Y)") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE)| Aspek | Regresi Poisson | OLS pada log(Y) |
|---|---|---|
| Jenis respons | Hitungan nonnegatif | Kontinu positif atau hitungan besar tanpa nol |
| Nilai nol | Dapat langsung menangani Y = 0 | Tidak dapat memakai Y = 0 tanpa modifikasi |
| Target model | E(Y | X) | E(log Y | X) |
| Bentuk mean | log{E(Y | X)} = X beta | log(Y) = X beta + error |
| Interpretasi | exp(beta) sebagai incidence rate ratio | exp(beta) sebagai multiplier pada geometric mean |
| Prediksi | Selalu nonnegatif | Perlu retransformation untuk kembali ke skala Y |
| Exposure | Offset log(exposure) sangat alami | Biasanya memakai log(rate) atau kontrol exposure manual |
| Kapan lebih tepat? | Count data, banyak nol, rate kejadian, atau target mean hitungan | Y positif, log-scale residual baik, dan fokus pada perubahan persentase |
studytime — semakin banyak waktu
belajar, semakin sedikit ketidakhadiran (IRR < 1).failures — riwayat kegagalan
berhubungan positif dengan absensi (IRR > 1).internet dan
health — kondisi kesehatan dan akses
internet berkaitan dengan pola kehadiran.⚠️ Peringatan penting: Overdispersion Pearson ≈ 11.2 menandakan bahwa p-value dan interval kepercayaan pada tabel koefisien tidak dapat diandalkan sepenuhnya. Kesimpulan tentang signifikansi variabel harus disampaikan dengan hati-hati. Model Negative Binomial direkomendasikan sebagai analisis lanjutan.
Kesimpulan: Model regresi Poisson berhasil diestimasi dan mengidentifikasi arah pengaruh tiap prediktor terhadap jumlah ketidakhadiran siswa. Namun, adanya overdispersion kuat mengharuskan interpretasi hasil dilakukan dengan kehati-hatian ekstra.
Keterbatasan:
absences
maksimum 75 hari sangat memengaruhi estimasi. Perlu diperiksa apakah
observasi ini valid.| Multinomial | Ordinal | Poisson | |
|---|---|---|---|
| Variabel Y | Mjob (nominal) |
prestasi (ordinal) |
absences (count) |
| Fungsi utama | nnet::multinom() |
MASS::polr() |
glm(..., poisson) |
| Interpretasi | RRR vs baseline | OR kumulatif / slide | IRR |
| Evaluasi asumsi | VIF + IIA | Brant Test + Parallel Lines | Dispersion Pearson |
| Catatan utama | RRR ekstrem, ketidakseimbangan kelas | Uji proportional odds wajib | Overdispersion kuat → NB lebih tepat |
Dataset: UCI Student Performance Dataset (student-mat.csv).