1 Pendahuluan

Dokumen ini menyajikan tiga studi kasus analisis regresi yang diterapkan pada satu dataset yang sama, yaitu data prestasi siswa matematika (student-mat.csv). Dataset ini berisi informasi demografis, sosial, dan akademik dari 395 siswa sekolah menengah di Portugal.

Ketiga model dipilih berdasarkan jenis variabel respons yang dianalisis:

Studi Kasus	Model	Variabel Y	Jenis Y
1	Regresi Logistik Multinomial	Pekerjaan ibu (`Mjob`)	Nominal (5 kategori)
2	Regresi Logistik Ordinal	Tingkat prestasi (`prestasi`)	Ordinal (3 kategori)
3	Regresi Poisson	Jumlah ketidakhadiran (`absences`)	Count (bilangan cacah)

2 Persiapan Data

2.1 Import dan Struktur Data

d1 <- read.csv("student-mat.csv", sep = ";", header = TRUE)
head(d1)

school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian	traveltime	studytime	failures	schoolsup	famsup	paid	activities	nursery	higher	internet	romantic	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
GP	F	18	U	GT3	A	4	4	at_home	teacher	course	mother	2	2	0	yes	no	no	no	yes	yes	no	no	4	3	4	1	1	3	6	5	6	6
GP	F	17	U	GT3	T	1	1	at_home	other	course	father	1	2	0	no	yes	no	no	no	yes	yes	no	5	3	3	1	1	3	4	5	5	6
GP	F	15	U	LE3	T	1	1	at_home	other	other	mother	1	2	3	yes	no	yes	no	yes	yes	yes	no	4	3	2	2	3	3	10	7	8	10
GP	F	15	U	GT3	T	4	2	health	services	home	mother	1	3	0	no	yes	yes	yes	yes	yes	yes	yes	3	2	2	1	1	5	2	15	14	15
GP	F	16	U	GT3	T	3	3	other	other	home	father	1	2	0	no	yes	yes	no	yes	yes	no	no	4	3	2	1	2	5	4	6	10	10
GP	M	16	U	LE3	T	4	3	services	other	reputation	mother	1	2	0	no	yes	yes	yes	yes	yes	yes	no	5	4	2	1	2	5	10	15	15	15

str(d1)

## 'data.frame':    395 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

dim(d1)

## [1] 395  33

2.2 Konversi Tipe Variabel

Variabel kategorikal dikonversi menjadi factor agar dapat diproses dengan benar oleh fungsi-fungsi pemodelan.

d1$school     <- as.factor(d1$school)
d1$sex        <- as.factor(d1$sex)
d1$address    <- as.factor(d1$address)
d1$famsize    <- as.factor(d1$famsize)
d1$Pstatus    <- as.factor(d1$Pstatus)
d1$Mjob       <- as.factor(d1$Mjob)
d1$Fjob       <- as.factor(d1$Fjob)
d1$reason     <- as.factor(d1$reason)
d1$guardian   <- as.factor(d1$guardian)
d1$schoolsup  <- as.factor(d1$schoolsup)
d1$famsup     <- as.factor(d1$famsup)
d1$paid       <- as.factor(d1$paid)
d1$activities <- as.factor(d1$activities)
d1$nursery    <- as.factor(d1$nursery)
d1$higher     <- as.factor(d1$higher)
d1$internet   <- as.factor(d1$internet)
d1$romantic   <- as.factor(d1$romantic)

str(d1)

## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

2.3 Eksplorasi Awal

summary(d1)

##  school   sex          age       address famsize   Pstatus      Medu      
##  GP:349   F:208   Min.   :15.0   R: 88   GT3:281   A: 41   Min.   :0.000  
##  MS: 46   M:187   1st Qu.:16.0   U:307   LE3:114   T:354   1st Qu.:2.000  
##                   Median :17.0                             Median :3.000  
##                   Mean   :16.7                             Mean   :2.749  
##                   3rd Qu.:18.0                             3rd Qu.:4.000  
##                   Max.   :22.0                             Max.   :4.000  
##       Fedu             Mjob           Fjob            reason      guardian  
##  Min.   :0.000   at_home : 59   at_home : 20   course    :145   father: 90  
##  1st Qu.:2.000   health  : 34   health  : 18   home      :109   mother:273  
##  Median :2.000   other   :141   other   :217   other     : 36   other : 32  
##  Mean   :2.522   services:103   services:111   reputation:105               
##  3rd Qu.:3.000   teacher : 58   teacher : 29                                
##  Max.   :4.000                                                              
##    traveltime      studytime        failures      schoolsup famsup     paid    
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   no :344   no :153   no :214  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:242   yes:181  
##  Median :1.000   Median :2.000   Median :0.0000                                
##  Mean   :1.448   Mean   :2.035   Mean   :0.3342                                
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                                
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                                
##  activities nursery   higher    internet  romantic      famrel     
##  no :194    no : 81   no : 20   no : 66   no :263   Min.   :1.000  
##  yes:201    yes:314   yes:375   yes:329   yes:132   1st Qu.:4.000  
##                                                     Median :4.000  
##                                                     Mean   :3.944  
##                                                     3rd Qu.:5.000  
##                                                     Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      health         absences            G1              G2       
##  Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
##  1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
##  Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
##  Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
##  3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
##        G3       
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.42  
##  3rd Qu.:14.00  
##  Max.   :20.00

colSums(is.na(d1))

##     school        sex        age    address    famsize    Pstatus       Medu 
##          0          0          0          0          0          0          0 
##       Fedu       Mjob       Fjob     reason   guardian traveltime  studytime 
##          0          0          0          0          0          0          0 
##   failures  schoolsup     famsup       paid activities    nursery     higher 
##          0          0          0          0          0          0          0 
##   internet   romantic     famrel   freetime      goout       Dalc       Walc 
##          0          0          0          0          0          0          0 
##     health   absences         G1         G2         G3 
##          0          0          0          0          0

Tidak ditemukan nilai hilang (missing value) pada dataset ini. Total observasi: 395 siswa, 33 variabel.

3 Studi Kasus 1: Regresi Logistik Multinomial

3.1 Latar Belakang

Pertanyaan penelitian: Apakah pendidikan ayah dan ibu, pekerjaan ayah, jenis alamat, dan asal sekolah memengaruhi probabilitas kategori pekerjaan ibu siswa?

Variabel yang digunakan:

Peran	Variabel	Keterangan
Respons (Y)	`Mjob`	Pekerjaan ibu: at_home, health, other, services, teacher
Prediktor (X)	`Medu`	Tingkat pendidikan ibu (0–4)
	`Fedu`	Tingkat pendidikan ayah (0–4)
	`Fjob`	Pekerjaan ayah
	`address`	Jenis alamat (U = Urban, R = Rural)
	`school`	Asal sekolah (GP / MS)

Alasan pemilihan model: Variabel Mjob bersifat nominal — tidak ada urutan alami antar kategorinya — sehingga tepat dimodelkan dengan regresi logistik multinomial. Model menghasilkan log-odds tiap kategori relatif terhadap baseline (at_home).

3.2 Distribusi Variabel Respons

d1 %>%
  count(Mjob) %>%
  mutate(proporsi = n / sum(n)) %>%
  kable(digits = 3, caption = "Distribusi pekerjaan ibu siswa (Mjob)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = FALSE)

Distribusi pekerjaan ibu siswa (Mjob)
Mjob	n	proporsi
at_home	59	0.149
health	34	0.086
other	141	0.357
services	103	0.261
teacher	58	0.147

d1 %>%
  count(Mjob) %>%
  mutate(proporsi = n / sum(n)) %>%
  ggplot(aes(x = Mjob, y = proporsi, fill = Mjob)) +
  geom_col(width = 0.65, alpha = 0.94) +
  geom_text(aes(label = percent(proporsi, accuracy = 0.1)),
            vjust = -0.4, fontface = "bold") +
  scale_y_continuous(labels = percent_format(), limits = c(0, 0.50)) +
  scale_fill_manual(values = c("#3d6cb5","#5568b8","#7b8fc4","#a0aed4","#c5cfe7")) +
  labs(title    = "Distribusi Pekerjaan Ibu (Mjob)",
       subtitle = "Respons multinomial: kategori tidak memiliki urutan alami.",
       x = NULL, y = "Proporsi") +
  theme_minimal() +
  theme(legend.position = "none")

Kategori terbanyak adalah other (35.7%), diikuti services (26.1%). Kategori health paling sedikit (8.6%). Ketidakseimbangan ini perlu diperhatikan dalam interpretasi.

3.3 Estimasi Model

fit_multi1 <- nnet::multinom(
  Mjob ~ Medu + Fedu + Fjob + address + school,
  data  = d1,
  trace = FALSE
)

summary(fit_multi1)

## Call:
## nnet::multinom(formula = Mjob ~ Medu + Fedu + Fjob + address + 
##     school, data = d1, trace = FALSE)
## 
## Coefficients:
##          (Intercept)     Medu       Fedu Fjobhealth  Fjobother Fjobservices
## health    -23.625990 3.292148 -0.7507459 17.0765553 14.4795881    15.475192
## other      -2.259777 1.076036 -0.4156804  0.7646605  1.4982127     1.043269
## services   -3.497796 1.687781 -0.5296715  1.3216088  0.3314372     1.602217
## teacher   -16.744749 5.251466 -0.7996462  3.4310301  1.2423027     2.823853
##          Fjobteacher  addressU    schoolMS
## health    12.7754395 1.3007337  0.02253365
## other      0.9040003 0.7277724  0.44136542
## services   0.4615745 0.9633206 -0.19141750
## teacher    1.6020656 0.2982862  0.22807732
## 
## Std. Errors:
##          (Intercept)      Medu      Fedu Fjobhealth Fjobother Fjobservices
## health     1.2084714 0.4514986 0.3182512   0.940695 0.5388472    0.5788467
## other      0.8587155 0.2435597 0.2096895   1.237330 0.6775827    0.7267809
## services   0.9407783 0.2726429 0.2321159   1.192889 0.7164090    0.7498780
## teacher    2.7393358 0.7088261 0.3232948   1.603415 1.0966165    1.1392554
##          Fjobteacher  addressU  schoolMS
## health      1.178502 0.7527853 0.9558460
## other       1.099339 0.3780216 0.5007841
## services    1.129457 0.4318906 0.5938807
## teacher     1.378014 0.6275073 0.8189497
## 
## Residual Deviance: 883.4601 
## AIC: 955.4601

3.4 Tabel Koefisien, RRR, dan Confidence Interval

RRR (Relative Risk Ratio) = exp(koefisien), mengukur berapa kali lebih mungkin suatu kategori dibanding baseline (at_home) untuk setiap kenaikan 1 satuan prediktor.

multi_sum1  <- summary(fit_multi1)
coef_multi1 <- as.data.frame(multi_sum1$coefficients)
se_multi1   <- as.data.frame(multi_sum1$standard.errors)

coef_long1 <- coef_multi1 %>%
  tibble::rownames_to_column("kategori") %>%
  tidyr::pivot_longer(cols = -kategori, names_to = "variabel", values_to = "estimate")

se_long1 <- se_multi1 %>%
  tibble::rownames_to_column("kategori") %>%
  tidyr::pivot_longer(cols = -kategori, names_to = "variabel", values_to = "std_error")

result_multi1 <- coef_long1 %>%
  left_join(se_long1, by = c("kategori","variabel")) %>%
  mutate(
    z_value = estimate / std_error,
    p_value = 2 * (1 - pnorm(abs(z_value))),
    RRR     = exp(estimate),
    CI_low  = exp(estimate - 1.96 * std_error),
    CI_high = exp(estimate + 1.96 * std_error)
  )

result_multi1 %>%
  mutate(across(c(estimate, std_error, z_value, p_value, RRR, CI_low, CI_high),
                ~ round(.x, 4))) %>%
  kable(caption   = "Ringkasan hasil regresi logistik multinomial (Baseline: at_home)",
        col.names = c("Kategori","Variabel","Estimate","SE",
                      "z-value","p-value","RRR","CI 2.5%","CI 97.5%")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = TRUE)

Ringkasan hasil regresi logistik multinomial (Baseline: at_home)
Kategori	Variabel	Estimate	SE	z-value	p-value	RRR	CI 2.5%	CI 97.5%
health	(Intercept)	-23.6260	1.2085	-19.5503	0.0000	0.000000e+00	0.0000	0.000000e+00
health	Medu	3.2921	0.4515	7.2916	0.0000	2.690060e+01	11.1030	6.517560e+01
health	Fedu	-0.7507	0.3183	-2.3590	0.0183	4.720000e-01	0.2530	8.808000e-01
health	Fjobhealth	17.0766	0.9407	18.1531	0.0000	2.607677e+07	4125892.8743	1.648123e+08
health	Fjobother	14.4796	0.5388	26.8714	0.0000	1.942698e+06	675661.5187	5.585746e+06
health	Fjobservices	15.4752	0.5788	26.7345	0.0000	5.257638e+06	1690699.3888	1.634989e+07
health	Fjobteacher	12.7754	1.1785	10.8404	0.0000	3.534296e+05	35086.6167	3.560117e+06
health	addressU	1.3007	0.7528	1.7279	0.0840	3.672000e+00	0.8397	1.605780e+01
health	schoolMS	0.0225	0.9558	0.0236	0.9812	1.022800e+00	0.1571	6.659100e+00
other	(Intercept)	-2.2598	0.8587	-2.6316	0.0085	1.044000e-01	0.0194	5.618000e-01
other	Medu	1.0760	0.2436	4.4180	0.0000	2.933000e+00	1.8197	4.727600e+00
other	Fedu	-0.4157	0.2097	-1.9824	0.0474	6.599000e-01	0.4375	9.953000e-01
other	Fjobhealth	0.7647	1.2373	0.6180	0.5366	2.148300e+00	0.1900	2.428420e+01
other	Fjobother	1.4982	0.6776	2.2111	0.0270	4.473700e+00	1.1855	1.688250e+01
other	Fjobservices	1.0433	0.7268	1.4355	0.1512	2.838500e+00	0.6830	1.179600e+01
other	Fjobteacher	0.9040	1.0993	0.8223	0.4109	2.469500e+00	0.2863	2.130000e+01
other	addressU	0.7278	0.3780	1.9252	0.0542	2.070500e+00	0.9869	4.343600e+00
other	schoolMS	0.4414	0.5008	0.8813	0.3781	1.554800e+00	0.5826	4.149100e+00
services	(Intercept)	-3.4978	0.9408	-3.7180	0.0002	3.030000e-02	0.0048	1.913000e-01
services	Medu	1.6878	0.2726	6.1904	0.0000	5.407500e+00	3.1690	9.227200e+00
services	Fedu	-0.5297	0.2321	-2.2819	0.0225	5.888000e-01	0.3736	9.280000e-01
services	Fjobhealth	1.3216	1.1929	1.1079	0.2679	3.749400e+00	0.3619	3.884850e+01
services	Fjobother	0.3314	0.7164	0.4626	0.6436	1.393000e+00	0.3421	5.672300e+00
services	Fjobservices	1.6022	0.7499	2.1366	0.0326	4.964000e+00	1.1416	2.158460e+01
services	Fjobteacher	0.4616	1.1295	0.4087	0.6828	1.586600e+00	0.1734	1.451690e+01
services	addressU	0.9633	0.4319	2.2305	0.0257	2.620400e+00	1.1239	6.109400e+00
services	schoolMS	-0.1914	0.5939	-0.3223	0.7472	8.258000e-01	0.2578	2.644800e+00
teacher	(Intercept)	-16.7447	2.7393	-6.1127	0.0000	0.000000e+00	0.0000	0.000000e+00
teacher	Medu	5.2515	0.7088	7.4087	0.0000	1.908458e+02	47.5683	7.656806e+02
teacher	Fedu	-0.7996	0.3233	-2.4734	0.0134	4.495000e-01	0.2385	8.471000e-01
teacher	Fjobhealth	3.4310	1.6034	2.1398	0.0324	3.090850e+01	1.3342	7.160310e+02
teacher	Fjobother	1.2423	1.0966	1.1329	0.2573	3.463600e+00	0.4037	2.971560e+01
teacher	Fjobservices	2.8239	1.1393	2.4787	0.0132	1.684160e+01	1.8056	1.570861e+02
teacher	Fjobteacher	1.6021	1.3780	1.1626	0.2450	4.963300e+00	0.3333	7.391920e+01
teacher	addressU	0.2983	0.6275	0.4754	0.6345	1.347500e+00	0.3939	4.609900e+00
teacher	schoolMS	0.2281	0.8189	0.2785	0.7806	1.256200e+00	0.2523	6.254000e+00

3.5 Prediksi dan Evaluasi Model

pred_prob_multi1 <- predict(fit_multi1, type = "probs")
head(pred_prob_multi1)

##       at_home      health      other  services      teacher
## 1 0.023461254 0.043429695 0.17569802 0.3033497 4.540613e-01
## 2 0.309801087 0.001539819 0.57969019 0.1089623 6.628279e-06
## 3 0.309801087 0.001539819 0.57969019 0.1089623 6.628279e-06
## 4 0.001706192 0.210882586 0.03372738 0.1990956 5.545882e-01
## 5 0.079784341 0.063934974 0.55925243 0.2844669 1.256137e-02
## 6 0.010817472 0.233189202 0.22239874 0.2085613 3.250333e-01

d1_pred1 <- d1 %>%
  bind_cols(as.data.frame(pred_prob_multi1)) %>%
  mutate(prediksi = predict(fit_multi1, type = "class"))

head(d1_pred1)

school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian	traveltime	studytime	failures	schoolsup	famsup	paid	activities	nursery	higher	internet	romantic	famrel	freetime	goout	Dalc	Walc	health…29	absences	G1	G2	G3	at_home	health…35	other	services	teacher	prediksi
GP	F	18	U	GT3	A	4	4	at_home	teacher	course	mother	2	2	0	yes	no	no	no	yes	yes	no	no	4	3	4	1	1	3	6	5	6	6	0.0234613	0.0434297	0.1756980	0.3033497	0.4540613	teacher
GP	F	17	U	GT3	T	1	1	at_home	other	course	father	1	2	0	no	yes	no	no	no	yes	yes	no	5	3	3	1	1	3	4	5	5	6	0.3098011	0.0015398	0.5796902	0.1089623	0.0000066	other
GP	F	15	U	LE3	T	1	1	at_home	other	other	mother	1	2	3	yes	no	yes	no	yes	yes	yes	no	4	3	2	2	3	3	10	7	8	10	0.3098011	0.0015398	0.5796902	0.1089623	0.0000066	other
GP	F	15	U	GT3	T	4	2	health	services	home	mother	1	3	0	no	yes	yes	yes	yes	yes	yes	yes	3	2	2	1	1	5	2	15	14	15	0.0017062	0.2108826	0.0337274	0.1990956	0.5545882	teacher
GP	F	16	U	GT3	T	3	3	other	other	home	father	1	2	0	no	yes	yes	no	yes	yes	no	no	4	3	2	1	2	5	4	6	10	10	0.0797843	0.0639350	0.5592524	0.2844669	0.0125614	other
GP	M	16	U	LE3	T	4	3	services	other	reputation	mother	1	2	0	no	yes	yes	yes	yes	yes	yes	no	5	4	2	1	2	5	10	15	15	15	0.0108175	0.2331892	0.2223987	0.2085613	0.3250333	teacher

conf_multi1 <- table(Aktual = d1_pred1$Mjob, Prediksi = d1_pred1$prediksi)
conf_multi1

##           Prediksi
## Aktual     at_home health other services teacher
##   at_home       22      0    28        8       1
##   health         0      6     9        3      16
##   other         15      1    90       16      19
##   services       4      1    38       38      22
##   teacher        0      4     4        2      48

accuracy_multi1 <- sum(diag(conf_multi1)) / sum(conf_multi1)
accuracy_multi1

## [1] 0.5164557

3.6 Visualisasi Prediksi Probabilitas

grid_multi1 <- expand.grid(
  Medu    = seq(min(d1$Medu), max(d1$Medu), length.out = 100),
  Fedu    = median(d1$Fedu),
  Fjob    = "other",
  address = "U",
  school  = "GP"
)

grid_prob_multi1 <- predict(fit_multi1, newdata = grid_multi1, type = "probs")

grid_multi1_plot <- grid_multi1 %>%
  bind_cols(as.data.frame(grid_prob_multi1)) %>%
  tidyr::pivot_longer(cols = c("at_home","health","other","services","teacher"),
                      names_to = "Mjob", values_to = "probabilitas")

ggplot(grid_multi1_plot, aes(x = Medu, y = probabilitas, color = Mjob)) +
  geom_line(linewidth = 1.35) +
  scale_y_continuous(labels = percent_format()) +
  scale_color_manual(values = c("#2f7f73","#5568b8","#d18b2f","#c0392b","#8e44ad")) +
  labs(title    = "Prediksi Probabilitas Pekerjaan Ibu (Mjob)",
       subtitle = "Variabel lain ditahan pada nilai modus; Fjob = other, address = U, school = GP.",
       x = "Pendidikan ibu (Medu)", y = "Probabilitas prediksi", color = "Pekerjaan Ibu") +
  theme_minimal()

3.7 Evaluasi Asumsi Model

3.7.1 Multikolinearitas

Multikolinearitas diperiksa menggunakan VIF (Variance Inflation Factor). Karena nnet::multinom() tidak memiliki fungsi VIF langsung, digunakan model regresi linear bantu pada tiap prediktor numerik.

# Model bantu menggunakan lm() untuk menghitung VIF prediktor numerik
vif_model <- lm(Medu ~ Fedu + address + school, data = d1)
car::vif(vif_model)

##     Fedu  address   school 
## 1.009148 1.087887 1.089157

Panduan interpretasi VIF:

VIF < 5 → tidak ada masalah multikolinearitas
VIF 5–10 → multikolinearitas sedang, perlu diwaspadai
VIF > 10 → multikolinearitas kuat, perlu penanganan

Medu dan Fedu berpotensi berkorelasi karena sama-sama mengukur tingkat pendidikan orang tua. Jika VIF tinggi, pertimbangkan untuk hanya menggunakan salah satu variabel.

3.7.2 Asumsi IIA

IIA (Independence of Irrelevant Alternatives) adalah asumsi utama model multinomial logit: peluang relatif antara dua kategori tidak berubah meskipun kategori lain ditambahkan atau dihilangkan.

Implikasi pada studi kasus ini:

Model mengasumsikan bahwa peluang relatif ibu bekerja sebagai health vs at_home tidak terpengaruh oleh ada atau tidaknya kategori teacher. Asumsi ini bisa bermasalah jika kategori-kategori pekerjaan saling mensubstitusi — misalnya services dan other yang mungkin tumpang tindih secara konseptual.

Cara deteksi: Uji Hausman-McFadden (tersedia di paket mlogit). Untuk keperluan studi kasus ini, IIA diasumsikan terpenuhi namun perlu diingat sebagai keterbatasan.

3.8 Interpretasi Hasil

Medu (pendidikan ibu) — semakin tinggi pendidikan ibu, semakin besar kemungkinan ia bekerja sebagai teacher atau health dibanding at_home.
Fjob (pekerjaan ayah) — pekerjaan ayah berkorelasi dengan pekerjaan ibu di beberapa kategori.
Akurasi — kategori dengan frekuensi kecil seperti health cenderung sulit diprediksi. RRR yang sangat ekstrem (sangat besar atau sangat kecil) menandakan kemungkinan adanya perfect separation atau ketidakseimbangan kelas yang kuat.

⚠️ Catatan: Model multinomial dapat menghasilkan RRR yang sangat ekstrem jika ada kategori dengan frekuensi sangat kecil (seperti health = 8.6%). Interpretasi koefisien untuk kategori tersebut harus dilakukan dengan hati-hati.

3.9 Kesimpulan dan Keterbatasan

Kesimpulan: Model regresi logistik multinomial berhasil diestimasi dengan lima kategori respons. Variabel pendidikan ibu (Medu) dan pekerjaan ayah (Fjob) tampak menjadi prediktor paling informatif dalam membedakan antar kategori pekerjaan ibu.

Keterbatasan:

Ketidakseimbangan kelas — kategori health hanya 8.6% dari data, sehingga prediksi untuk kategori ini cenderung kurang akurat dan koefisiennya tidak stabil.
Asumsi IIA — model multinomial mengasumsikan pilihan antar kategori bersifat independen, yang belum tentu terpenuhi jika kategori-kategori pekerjaan saling berdekatan secara konseptual.
Multikolinearitas — Medu dan Fedu berpotensi berkorelasi tinggi, sehingga interpretasi koefisien masing-masing perlu dilakukan dengan hati-hati.
RRR ekstrem — beberapa nilai RRR yang sangat besar atau kecil perlu diwaspadai sebagai indikasi ketidakstabilan estimasi akibat ukuran kategori yang kecil.

4 Studi Kasus 2: Regresi Logistik Ordinal

4.1 Latar Belakang

Pertanyaan penelitian: Apakah waktu belajar, riwayat kegagalan, jumlah ketidakhadiran, dan akses internet memengaruhi probabilitas tingkat prestasi akademik siswa?

Variabel yang digunakan:

Peran	Variabel	Keterangan
Respons (Y)	`prestasi`	Dibuat dari `G3`: Rendah (0–9), Sedang (10–14), Tinggi (15–20)
Prediktor (X)	`studytime`	Waktu belajar per minggu (1–4)
	`failures`	Jumlah kegagalan kelas sebelumnya (0–3)
	`absences`	Jumlah ketidakhadiran
	`internet`	Akses internet di rumah (yes/no)

Alasan pemilihan model: Variabel prestasi memiliki urutan alami (Rendah < Sedang < Tinggi), sehingga tepat dimodelkan dengan regresi logistik ordinal menggunakan MASS::polr(). Model ini mengasumsikan proportional odds — efek prediktor sama di semua batas kumulatif.

4.2 Pembuatan Variabel Respons Ordinal

d1 <- d1 %>%
  mutate(
    prestasi = cut(G3,
                   breaks         = c(-Inf, 9, 14, Inf),
                   labels         = c("Rendah","Sedang","Tinggi"),
                   ordered_result = TRUE)
  )

dplyr::glimpse(d1)

## Rows: 395
## Columns: 34
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services, other, …
## $ Fjob       <fct> teacher, other, other, services, other, other, other, teach…
## $ reason     <fct> course, course, other, home, home, reputation, home, home, …
## $ guardian   <fct> mother, father, mother, mother, father, mother, mother, mot…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, …
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, ye…
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes,…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no…
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, ye…
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, ye…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 14,…
## $ prestasi   <ord> Rendah, Rendah, Sedang, Tinggi, Sedang, Tinggi, Sedang, Ren…

4.3 Distribusi Variabel Respons

d1 %>%
  count(prestasi) %>%
  mutate(proporsi = n / sum(n)) %>%
  kable(digits = 3, caption = "Distribusi tingkat prestasi akademik siswa") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = FALSE)

Distribusi tingkat prestasi akademik siswa
prestasi	n	proporsi
Rendah	130	0.329
Sedang	192	0.486
Tinggi	73	0.185

d1 %>%
  count(prestasi) %>%
  mutate(proporsi = n / sum(n)) %>%
  ggplot(aes(x = prestasi, y = proporsi, fill = prestasi)) +
  geom_col(width = 0.65, alpha = 0.94) +
  geom_text(aes(label = percent(proporsi, accuracy = 0.1)),
            vjust = -0.4, fontface = "bold", color = "#243142") +
  scale_y_continuous(labels = percent_format(), limits = c(0, 0.55)) +
  scale_fill_manual(values = c("#2f7f73","#5568b8","#d18b2f")) +
  labs(title    = "Distribusi Tingkat Prestasi Akademik",
       subtitle = "Respons ordinal: kategori memiliki urutan alami.",
       x = NULL, y = "Proporsi") +
  theme_minimal() +
  theme(legend.position = "none")

4.4 Estimasi Model

fit_ord <- MASS::polr(
  prestasi ~ studytime + failures + absences + internet,
  data   = d1,
  method = "logistic",
  Hess   = TRUE
)

summary(fit_ord)

## Call:
## MASS::polr(formula = prestasi ~ studytime + failures + absences + 
##     internet, data = d1, Hess = TRUE, method = "logistic")
## 
## Coefficients:
##                Value Std. Error t value
## studytime    0.07819    0.11916  0.6561
## failures    -0.98287    0.16796 -5.8520
## absences    -0.02845    0.01283 -2.2173
## internetyes  0.51706    0.26453  1.9546
## 
## Intercepts:
##               Value   Std. Error t value
## Rendah|Sedang -0.6725  0.3484    -1.9303
## Sedang|Tinggi  1.7450  0.3608     4.8365
## 
## Residual Deviance: 753.7981 
## AIC: 765.7981

4.5 Tabel Koefisien, OR, dan Confidence Interval

Perhatikan dua konvensi tanda pada output polr():

Estimate polr — tanda seperti yang dikeluarkan R
Estimate slide — tanda dibalik untuk koefisien (bukan cutpoint), agar interpretasi OR lebih intuitif: nilai positif berarti meningkatkan peluang menuju kategori lebih tinggi

ord_coef <- coef(summary(fit_ord))

result_ord <- as.data.frame(ord_coef) %>%
  tibble::rownames_to_column("parameter") %>%
  dplyr::rename(estimate = Value, std_error = `Std. Error`, t_value = `t value`) %>%
  mutate(
    p_value        = 2 * (1 - pnorm(abs(t_value))),
    jenis          = ifelse(grepl("\\|", parameter), "Cutpoint", "Koefisien"),
    estimate_slide = ifelse(jenis == "Koefisien", -estimate, estimate),
    OR_polr        = ifelse(jenis == "Koefisien", exp(estimate),                    NA_real_),
    OR_slide       = ifelse(jenis == "Koefisien", exp(estimate_slide),              NA_real_),
    CI_low_polr    = ifelse(jenis == "Koefisien", exp(estimate - 1.96 * std_error), NA_real_),
    CI_high_polr   = ifelse(jenis == "Koefisien", exp(estimate + 1.96 * std_error), NA_real_)
  )

result_ord %>%
  mutate(across(c(estimate, estimate_slide, std_error, t_value, p_value,
                  OR_polr, OR_slide, CI_low_polr, CI_high_polr),
                ~ round(.x, 4))) %>%
  kable(caption   = "Ringkasan hasil regresi logistik ordinal",
        col.names = c("Parameter","Estimate polr","SE","z/t-value","p-value",
                      "Jenis","Estimate slide","OR polr","OR slide",
                      "CI polr 2.5%","CI polr 97.5%")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = TRUE)

Ringkasan hasil regresi logistik ordinal
Parameter	Estimate polr	SE	z/t-value	p-value	Jenis	Estimate slide	OR polr	OR slide	CI polr 2.5%	CI polr 97.5%
studytime	0.0782	0.1192	0.6561	0.5117	Koefisien	-0.0782	1.0813	0.9248	0.8561	1.3658
failures	-0.9829	0.1680	-5.8520	0.0000	Koefisien	0.9829	0.3742	2.6721	0.2693	0.5201
absences	-0.0284	0.0128	-2.2173	0.0266	Koefisien	0.0284	0.9720	1.0289	0.9478	0.9967
internetyes	0.5171	0.2645	1.9546	0.0506	Koefisien	-0.5171	1.6771	0.5963	0.9986	2.8166
Rendah\|Sedang	-0.6725	0.3484	-1.9303	0.0536	Cutpoint	-0.6725	NA	NA	NA	NA
Sedang\|Tinggi	1.7450	0.3608	4.8365	0.0000	Cutpoint	1.7450	NA	NA	NA	NA

4.6 Prediksi dan Evaluasi Model

pred_prob_ord <- predict(fit_ord, type = "probs")
head(pred_prob_ord)

##      Rendah    Sedang     Tinggi
## 1 0.3411401 0.5119855 0.14687441
## 2 0.2258012 0.5401113 0.23408758
## 3 0.8684273 0.1182474 0.01332537
## 4 0.2030637 0.5377672 0.25916918
## 5 0.3284704 0.5173821 0.15414751
## 6 0.2570244 0.5380933 0.20488230

d1_pred_ord <- d1 %>%
  bind_cols(as.data.frame(pred_prob_ord)) %>%
  mutate(prediksi = predict(fit_ord, type = "class"))

head(d1_pred_ord)

school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian	traveltime	studytime	failures	schoolsup	famsup	paid	activities	nursery	higher	internet	romantic	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3	prestasi	Rendah	Sedang	Tinggi	prediksi
GP	F	18	U	GT3	A	4	4	at_home	teacher	course	mother	2	2	0	yes	no	no	no	yes	yes	no	no	4	3	4	1	1	3	6	5	6	6	Rendah	0.3411401	0.5119855	0.1468744	Sedang
GP	F	17	U	GT3	T	1	1	at_home	other	course	father	1	2	0	no	yes	no	no	no	yes	yes	no	5	3	3	1	1	3	4	5	5	6	Rendah	0.2258012	0.5401113	0.2340876	Sedang
GP	F	15	U	LE3	T	1	1	at_home	other	other	mother	1	2	3	yes	no	yes	no	yes	yes	yes	no	4	3	2	2	3	3	10	7	8	10	Sedang	0.8684273	0.1182474	0.0133254	Rendah
GP	F	15	U	GT3	T	4	2	health	services	home	mother	1	3	0	no	yes	yes	yes	yes	yes	yes	yes	3	2	2	1	1	5	2	15	14	15	Tinggi	0.2030637	0.5377672	0.2591692	Sedang
GP	F	16	U	GT3	T	3	3	other	other	home	father	1	2	0	no	yes	yes	no	yes	yes	no	no	4	3	2	1	2	5	4	6	10	10	Sedang	0.3284704	0.5173821	0.1541475	Sedang
GP	M	16	U	LE3	T	4	3	services	other	reputation	mother	1	2	0	no	yes	yes	yes	yes	yes	yes	no	5	4	2	1	2	5	10	15	15	15	Tinggi	0.2570244	0.5380933	0.2048823	Sedang

conf_ord <- table(Aktual = d1_pred_ord$prestasi, Prediksi = d1_pred_ord$prediksi)
conf_ord

##         Prediksi
## Aktual   Rendah Sedang Tinggi
##   Rendah     46     84      0
##   Sedang     25    167      0
##   Tinggi      2     71      0

accuracy_ord <- sum(diag(conf_ord)) / sum(conf_ord)
accuracy_ord

## [1] 0.5392405

4.7 Visualisasi Prediksi Probabilitas

grid_ord <- expand.grid(
  studytime = seq(min(d1$studytime), max(d1$studytime), length.out = 120),
  failures  = mean(d1$failures),
  absences  = mean(d1$absences),
  internet  = "yes"
)

grid_prob_ord <- predict(fit_ord, newdata = grid_ord, type = "probs")

grid_ord_plot <- grid_ord %>%
  bind_cols(as.data.frame(grid_prob_ord)) %>%
  tidyr::pivot_longer(cols = c("Rendah","Sedang","Tinggi"),
                      names_to = "prestasi", values_to = "probabilitas")

ggplot(grid_ord_plot, aes(x = studytime, y = probabilitas, color = prestasi)) +
  geom_line(linewidth = 1.25) +
  scale_y_continuous(labels = percent_format()) +
  scale_color_manual(values = c("#2f7f73","#5568b8","#d18b2f")) +
  labs(title    = "Prediksi Probabilitas Tingkat Prestasi Akademik",
       subtitle = "Variabel lain ditahan pada nilai rata-rata; internet = yes.",
       x = "Waktu belajar (studytime)", y = "Probabilitas prediksi", color = "Prestasi") +
  theme_minimal()

4.8 Visualisasi Parallel Lines (Cumulative Logit)

Plot ini digunakan untuk memverifikasi asumsi proportional odds secara visual: ketiga garis kumulatif seharusnya sejajar satu sama lain.

eps <- 1e-6

grid_ord_cumlogit <- grid_ord %>%
  bind_cols(as.data.frame(grid_prob_ord)) %>%
  mutate(
    `P(Y <= Rendah)` = Rendah,
    `P(Y <= Sedang)` = Rendah + Sedang,
    `P(Y <= Tinggi)` = Rendah + Sedang + Tinggi
  ) %>%
  tidyr::pivot_longer(cols      = c(`P(Y <= Rendah)`, `P(Y <= Sedang)`, `P(Y <= Tinggi)`),
                      names_to  = "batas_kumulatif",
                      values_to = "prob_kumulatif") %>%
  mutate(prob_kumulatif   = pmin(pmax(prob_kumulatif, eps), 1 - eps),
         cumulative_logit = qlogis(prob_kumulatif))

ggplot(grid_ord_cumlogit,
       aes(x = studytime, y = cumulative_logit, color = batas_kumulatif)) +
  geom_line(linewidth = 1.25) +
  scale_color_manual(values = c("P(Y <= Rendah)" = "#2f7f73",
                                "P(Y <= Sedang)"  = "#d18b2f",
                                "P(Y <= Tinggi)"  = "#5568b8")) +
  labs(title    = "Parallel Lines pada Model Ordinal",
       subtitle = "Garis yang sejajar mengindikasikan asumsi proportional odds terpenuhi.",
       x = "Waktu belajar (studytime)", y = "Logit peluang kumulatif",
       color = "Batas kumulatif") +
  theme_minimal()

4.9 Evaluasi Asumsi Model

4.9.1 Brant Test

Brant Test menguji apakah asumsi proportional odds terpenuhi secara statistik. Jika p-value < 0.05 untuk suatu variabel, asumsi dilanggar untuk variabel tersebut.

# install.packages("brant")  # jalankan jika belum terpasang
library(brant)
brant::brant(fit_ord)

Interpretasi Brant Test:

H₀: Asumsi proportional odds terpenuhi (koefisien sama di semua batas kumulatif)
H₁: Asumsi proportional odds dilanggar
Jika p-value omnibus > 0.05 → asumsi terpenuhi, model ordinal valid
Jika p-value < 0.05 pada variabel tertentu → pertimbangkan partial proportional odds model

4.9.2 Parallel Lines Test (Formal)

Uji formal menggunakan ordinal::nominal_test() — membandingkan model dengan dan tanpa asumsi proportional odds untuk tiap prediktor.

fit_clm <- ordinal::clm(
  prestasi ~ studytime + failures + absences + internet,
  data = d1,
  link = "logit"
)

summary(fit_clm)

## formula: prestasi ~ studytime + failures + absences + internet
## data:    d1
## 
##  link  threshold nobs logLik  AIC    niter max.grad cond.H 
##  logit flexible  395  -376.90 765.80 5(0)  5.05e-11 2.7e+03
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## studytime    0.07819    0.11916   0.656   0.5117    
## failures    -0.98288    0.16796  -5.852 4.86e-09 ***
## absences    -0.02845    0.01283  -2.217   0.0266 *  
## internetyes  0.51706    0.26453   1.955   0.0506 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##               Estimate Std. Error z value
## Rendah|Sedang  -0.6725     0.3484  -1.930
## Sedang|Tinggi   1.7450     0.3608   4.836

ordinal::nominal_test(fit_clm)

	Df	logLik	AIC	LRT	Pr(>Chi)
	NA	-376.8990	765.7981	NA	NA
studytime	1	-376.5763	767.1527	0.6454078	0.4217593
failures	1	-376.1940	766.3880	1.4100916	0.2350412
absences	1	-375.5798	765.1596	2.6385036	0.1043017
internet	1	-375.4776	764.9553	2.8427806	0.0917853

Interpretasi: Jika p-value nominal_test untuk semua variabel > 0.05, asumsi proportional odds tidak ditolak dan model ordinal dapat dipertahankan. Jika ada variabel yang melanggar, pertimbangkan model ordinal generalisasi.

4.10 Interpretasi Hasil

failures — prediktor paling kuat; setiap tambahan 1 kegagalan sebelumnya secara signifikan menurunkan peluang berada di kategori prestasi lebih tinggi.
studytime — waktu belajar lebih banyak meningkatkan peluang berada di kategori Sedang atau Tinggi.
absences — ketidakhadiran lebih banyak menurunkan prestasi, sesuai ekspektasi.
internet — akses internet berpengaruh positif, kemungkinan karena mendukung belajar mandiri.
Cutpoints menunjukkan batas laten antarkategori, bukan koefisien prediktor.

4.11 Kesimpulan dan Keterbatasan

Kesimpulan: Model regresi logistik ordinal berhasil mengidentifikasi faktor-faktor yang memengaruhi tingkat prestasi siswa secara bertingkat. Riwayat kegagalan (failures) dan waktu belajar (studytime) adalah prediktor paling konsisten signifikan.

Keterbatasan:

Asumsi proportional odds — perlu diuji secara formal via Brant Test dan nominal_test(). Jika dilanggar, model parsial atau generalized ordinal lebih tepat digunakan.
Kategorisasi G3 bersifat arbitrer — batas (0–9, 10–14, 15–20) memengaruhi distribusi dan koefisien. Batas yang berbeda akan menghasilkan hasil yang berbeda.
Korelasi antar prediktor — failures dan absences berpotensi saling berkorelasi, yang dapat memengaruhi kestabilan estimasi koefisien.

5 Studi Kasus 3: Regresi Poisson

5.1 Latar Belakang

Pertanyaan penelitian: Apakah waktu belajar, riwayat kegagalan, akses internet, dan kondisi kesehatan memengaruhi jumlah ketidakhadiran siswa?

Variabel yang digunakan:

Peran	Variabel	Keterangan
Respons (Y)	`absences`	Jumlah hari tidak hadir (bilangan cacah ≥ 0)
Prediktor (X)	`studytime`	Waktu belajar per minggu (1–4)
	`failures`	Jumlah kegagalan kelas sebelumnya (0–3)
	`internet`	Akses internet di rumah (yes/no)
	`health`	Status kesehatan siswa (1–5)

Alasan pemilihan model: Variabel absences merupakan data hitungan (count data) — bilangan bulat non-negatif — yang secara alami dimodelkan dengan distribusi Poisson menggunakan glm(..., family = poisson(link = "log")).

5.2 Distribusi Variabel Respons

d1 %>%
  count(absences) %>%
  mutate(proporsi = n / sum(n)) %>%
  kable(digits = 3, caption = "Distribusi jumlah ketidakhadiran siswa (absences)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = FALSE)

Distribusi jumlah ketidakhadiran siswa (absences)
absences	n	proporsi
0	115	0.291
1	3	0.008
2	65	0.165
3	8	0.020
4	53	0.134
5	5	0.013
6	31	0.078
7	7	0.018
8	22	0.056
9	3	0.008
10	17	0.043
11	3	0.008
12	12	0.030
13	3	0.008
14	12	0.030
15	3	0.008
16	7	0.018
17	1	0.003
18	5	0.013
19	1	0.003
20	4	0.010
21	1	0.003
22	3	0.008
23	1	0.003
24	1	0.003
25	1	0.003
26	1	0.003
28	1	0.003
30	1	0.003
38	1	0.003
40	1	0.003
54	1	0.003
56	1	0.003
75	1	0.003

d1 %>%
  count(absences) %>%
  ggplot(aes(x = absences, y = n)) +
  geom_col(width = 0.8, fill = "#d18b2f", alpha = 0.85) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title    = "Distribusi Jumlah Ketidakhadiran Siswa",
       subtitle = "Respons Poisson berupa hitungan bilangan bulat nonnegatif.",
       x = "Jumlah ketidakhadiran (absences)", y = "Frekuensi") +
  theme_minimal()

5.3 Estimasi Model

fit_pois <- glm(
  absences ~ studytime + failures + internet + health,
  data   = d1,
  family = poisson(link = "log")
)

summary(fit_pois)

## 
## Call:
## glm(formula = absences ~ studytime + failures + internet + health, 
##     family = poisson(link = "log"), data = d1)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.61709    0.10108  15.999  < 2e-16 ***
## studytime   -0.10422    0.02635  -3.955 7.64e-05 ***
## failures     0.10412    0.02617   3.979 6.93e-05 ***
## internetyes  0.46255    0.06657   6.948 3.70e-12 ***
## health      -0.02924    0.01504  -1.945   0.0518 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3223.3  on 394  degrees of freedom
## Residual deviance: 3133.5  on 390  degrees of freedom
## AIC: 4152.9
## 
## Number of Fisher Scoring iterations: 6

5.4 Tabel Koefisien, IRR, dan Confidence Interval

IRR (Incidence Rate Ratio) = exp(koefisien). IRR > 1 berarti prediktor meningkatkan jumlah ketidakhadiran; IRR < 1 berarti menurunkan.

pois_coef <- as.data.frame(coef(summary(fit_pois))) %>%
  tibble::rownames_to_column("parameter") %>%
  dplyr::rename(estimate = Estimate, std_error = `Std. Error`,
                z_value = `z value`, p_value = `Pr(>|z|)`) %>%
  mutate(
    IRR              = exp(estimate),
    CI_low           = exp(estimate - 1.96 * std_error),
    CI_high          = exp(estimate + 1.96 * std_error),
    perubahan_persen = 100 * (IRR - 1)
  )

pois_coef %>%
  mutate(across(c(estimate, std_error, z_value, p_value,
                  IRR, CI_low, CI_high, perubahan_persen), ~ round(.x, 4))) %>%
  kable(caption   = "Ringkasan hasil regresi Poisson",
        col.names = c("Parameter","Estimate","SE","z-value","p-value",
                      "IRR","CI 2.5%","CI 97.5%","Perubahan (%)")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = TRUE)

Ringkasan hasil regresi Poisson
Parameter	Estimate	SE	z-value	p-value	IRR	CI 2.5%	CI 97.5%	Perubahan (%)
(Intercept)	1.6171	0.1011	15.9988	0.0000	5.0384	4.1329	6.1423	403.8431
studytime	-0.1042	0.0263	-3.9554	0.0001	0.9010	0.8557	0.9488	-9.8975
failures	0.1041	0.0262	3.9787	0.0001	1.1097	1.0542	1.1681	10.9733
internetyes	0.4626	0.0666	6.9483	0.0000	1.5881	1.3939	1.8095	58.8120
health	-0.0292	0.0150	-1.9448	0.0518	0.9712	0.9430	1.0002	-2.8817

5.5 Visualisasi Prediksi Rate

grid_pois <- expand.grid(
  studytime = seq(min(d1$studytime), max(d1$studytime), length.out = 120),
  failures  = mean(d1$failures),
  internet  = "yes",
  health    = mean(d1$health)
)

pred_pois <- predict(fit_pois, newdata = grid_pois, type = "link", se.fit = TRUE)

grid_pois_plot <- grid_pois %>%
  mutate(fit_link  = pred_pois$fit,
         se_link   = pred_pois$se.fit,
         rate      = exp(fit_link),
         rate_low  = exp(fit_link - 1.96 * se_link),
         rate_high = exp(fit_link + 1.96 * se_link))

ggplot(grid_pois_plot, aes(x = studytime, y = rate)) +
  geom_ribbon(aes(ymin = rate_low, ymax = rate_high), fill = "#d18b2f", alpha = 0.16) +
  geom_line(color = "#d18b2f", linewidth = 1.35) +
  labs(title    = "Prediksi Jumlah Ketidakhadiran Siswa",
       subtitle = "Variabel lain ditahan pada nilai rata-rata; internet = yes.",
       x = "Waktu belajar (studytime)", y = "Prediksi jumlah ketidakhadiran") +
  theme_minimal()

5.6 Evaluasi Asumsi Model

5.6.1 Dispersion Pearson

Poisson mengasumsikan mean = variance. Rasio dispersi Pearson mengukur seberapa jauh asumsi ini terpenuhi.

dispersion_pois <- sum(residuals(fit_pois, type = "pearson")^2) / df.residual(fit_pois)

tibble::tibble(
  `Dispersion Pearson` = dispersion_pois,
  `Interpretasi` = dplyr::case_when(
    dispersion_pois < 1.5 ~ "Tidak ada indikasi overdispersion berat",
    dispersion_pois < 2.5 ~ "Ada indikasi overdispersion sedang",
    TRUE                  ~ "Ada indikasi overdispersion kuat"
  )
) %>%
  mutate(`Dispersion Pearson` = round(`Dispersion Pearson`, 3)) %>%
  kable(caption = "Indikasi overdispersion pada model Poisson") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = FALSE)

Indikasi overdispersion pada model Poisson
Dispersion Pearson	Interpretasi
10.775	Ada indikasi overdispersion kuat

5.6.2 Cek Overdispersion & Nilai Nol

zero_share <- mean(d1$absences == 0)

tibble::tibble(
  `Proporsi Y = 0` = zero_share,
  `Catatan` = ifelse(
    zero_share > 0,
    "OLS pada log(Y) akan kehilangan observasi nol atau perlu transformasi log(Y+c).",
    "Tidak ada nilai nol, tetapi asumsi residual log-scale tetap perlu diperiksa."
  )
) %>%
  mutate(`Proporsi Y = 0` = percent(`Proporsi Y = 0`, accuracy = 0.1)) %>%
  kable(caption = "Konsekuensi nilai nol untuk model log(Y) dengan OLS") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = FALSE)

Konsekuensi nilai nol untuk model log(Y) dengan OLS
Proporsi Y = 0	Catatan
29.1%	OLS pada log(Y) akan kehilangan observasi nol atau perlu transformasi log(Y+c).

Ringkasan kondisi absences pada data ini:

Mean ≈ 5.71 | Variance ≈ 64.05 | Rasio Var/Mean ≈ 11.2
Proporsi nilai nol: 29.1% (115 dari 395 siswa)
Nilai maksimum: 75 hari (outlier ekstrem)

Rasio dispersi 11.2 jauh di atas 1, mengindikasikan overdispersion kuat.

5.6.3 Mengapa Negative Binomial Lebih Sesuai

Keterbatasan model Poisson pada data ini:

Model Poisson mengasumsikan mean = variance. Pada data absences, asumsi ini dilanggar secara serius (rasio ≈ 11.2). Akibatnya:

Standard error terlalu kecil → p-value terlalu optimistis → kesimpulan signifikansi tidak dapat diandalkan
Interval kepercayaan terlalu sempit → estimasi IRR tampak lebih presisi dari sebenarnya

Negative Binomial regression lebih sesuai karena:

Menambahkan parameter dispersi (θ) yang secara eksplisit memodelkan kelebihan variance
Variance NB: Var(Y) = μ + μ²/θ — saat θ → ∞, NB menyusut ke Poisson
Menghasilkan standard error yang lebih valid dan kesimpulan yang lebih hati-hati

Zero-Inflated Poisson (ZIP) bisa dipertimbangkan karena 29.1% siswa memiliki absences = 0, yang mungkin merupakan “structural zeros” (siswa yang memang tidak pernah absen karena alasan sistemik, bukan hanya kebetulan).

Untuk keperluan studi kasus ini, model Poisson tetap dijalankan dan dilaporkan, namun kesimpulan harus disampaikan dengan hati-hati.

5.7 Perbandingan: Poisson vs OLS pada log(Y)

d1_positive <- d1 %>% filter(absences > 0)

fit_log_ols <- lm(
  log(absences) ~ studytime + failures + internet + health,
  data = d1_positive
)

log_ols_coef <- as.data.frame(coef(summary(fit_log_ols))) %>%
  tibble::rownames_to_column("parameter") %>%
  dplyr::rename(estimate = Estimate, std_error = `Std. Error`,
                t_value = `t value`, p_value = `Pr(>|t|)`) %>%
  mutate(multiplier = exp(estimate), perubahan_persen = 100 * (multiplier - 1))

log_ols_coef %>%
  mutate(across(c(estimate, std_error, t_value, p_value,
                  multiplier, perubahan_persen), ~ round(.x, 4))) %>%
  kable(caption   = "Ilustrasi OLS pada log(absences) untuk observasi positif saja",
        col.names = c("Parameter","Estimate","SE","t-value","p-value",
                      "exp(Estimate)","Perubahan (%)")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = TRUE)

Ilustrasi OLS pada log(absences) untuk observasi positif saja
Parameter	Estimate	SE	t-value	p-value	exp(Estimate)	Perubahan (%)
(Intercept)	1.6199	0.2103	7.7010	0.0000	5.0525	405.2489
studytime	-0.0835	0.0587	-1.4240	0.1556	0.9199	-8.0148
failures	0.2027	0.0692	2.9304	0.0037	1.2247	22.4732
internetyes	0.3709	0.1288	2.8794	0.0043	1.4490	44.9014
health	-0.0269	0.0349	-0.7705	0.4416	0.9735	-2.6519

tibble::tibble(
  Aspek = c("Jenis respons","Nilai nol","Target model","Bentuk mean",
            "Interpretasi","Prediksi","Exposure","Kapan lebih tepat?"),
  `Regresi Poisson` = c(
    "Hitungan nonnegatif",
    "Dapat langsung menangani Y = 0",
    "E(Y | X)",
    "log{E(Y | X)} = X beta",
    "exp(beta) sebagai incidence rate ratio",
    "Selalu nonnegatif",
    "Offset log(exposure) sangat alami",
    "Count data, banyak nol, rate kejadian, atau target mean hitungan"),
  `OLS pada log(Y)` = c(
    "Kontinu positif atau hitungan besar tanpa nol",
    "Tidak dapat memakai Y = 0 tanpa modifikasi",
    "E(log Y | X)",
    "log(Y) = X beta + error",
    "exp(beta) sebagai multiplier pada geometric mean",
    "Perlu retransformation untuk kembali ke skala Y",
    "Biasanya memakai log(rate) atau kontrol exposure manual",
    "Y positif, log-scale residual baik, dan fokus pada perubahan persentase")
) %>%
  kable(caption = "Perbandingan regresi Poisson dan OLS pada log(Y)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = TRUE)

Perbandingan regresi Poisson dan OLS pada log(Y)
Aspek	Regresi Poisson	OLS pada log(Y)
Jenis respons	Hitungan nonnegatif	Kontinu positif atau hitungan besar tanpa nol
Nilai nol	Dapat langsung menangani Y = 0	Tidak dapat memakai Y = 0 tanpa modifikasi
Target model	E(Y \| X)	E(log Y \| X)
Bentuk mean	log{E(Y \| X)} = X beta	log(Y) = X beta + error
Interpretasi	exp(beta) sebagai incidence rate ratio	exp(beta) sebagai multiplier pada geometric mean
Prediksi	Selalu nonnegatif	Perlu retransformation untuk kembali ke skala Y
Exposure	Offset log(exposure) sangat alami	Biasanya memakai log(rate) atau kontrol exposure manual
Kapan lebih tepat?	Count data, banyak nol, rate kejadian, atau target mean hitungan	Y positif, log-scale residual baik, dan fokus pada perubahan persentase

5.8 Interpretasi Hasil

studytime — semakin banyak waktu belajar, semakin sedikit ketidakhadiran (IRR < 1).
failures — riwayat kegagalan berhubungan positif dengan absensi (IRR > 1).
internet dan health — kondisi kesehatan dan akses internet berkaitan dengan pola kehadiran.

⚠️ Peringatan penting: Overdispersion Pearson ≈ 11.2 menandakan bahwa p-value dan interval kepercayaan pada tabel koefisien tidak dapat diandalkan sepenuhnya. Kesimpulan tentang signifikansi variabel harus disampaikan dengan hati-hati. Model Negative Binomial direkomendasikan sebagai analisis lanjutan.

5.9 Kesimpulan dan Keterbatasan

Kesimpulan: Model regresi Poisson berhasil diestimasi dan mengidentifikasi arah pengaruh tiap prediktor terhadap jumlah ketidakhadiran siswa. Namun, adanya overdispersion kuat mengharuskan interpretasi hasil dilakukan dengan kehati-hatian ekstra.

Keterbatasan:

Overdispersion kuat (rasio dispersi Pearson ≈ 11.2) — asumsi Poisson (mean = variance) dilanggar secara serius. Model Negative Binomial lebih tepat digunakan karena menambahkan parameter dispersi eksplisit.
Excess zeros (29.1%) — jumlah nilai nol yang banyak mengindikasikan kemungkinan perlunya Zero-Inflated Poisson atau Zero-Inflated Negative Binomial.
Outlier ekstrem — nilai absences maksimum 75 hari sangat memengaruhi estimasi. Perlu diperiksa apakah observasi ini valid.
Tidak ada variabel exposure — semua siswa diasumsikan diamati dalam periode yang sama, yang belum tentu akurat.

6 Penutup

6.1 Ringkasan Tiga Studi Kasus

	Multinomial	Ordinal	Poisson
Variabel Y	`Mjob` (nominal)	`prestasi` (ordinal)	`absences` (count)
Fungsi utama	`nnet::multinom()`	`MASS::polr()`	`glm(..., poisson)`
Interpretasi	RRR vs baseline	OR kumulatif / slide	IRR
Evaluasi asumsi	VIF + IIA	Brant Test + Parallel Lines	Dispersion Pearson
Catatan utama	RRR ekstrem, ketidakseimbangan kelas	Uji proportional odds wajib	Overdispersion kuat → NB lebih tepat

6.2 Pemilihan Model Berdasarkan Jenis Respons

Jika respons nominal (tidak berurutan, >2 kategori) → Multinomial
Jika respons ordinal (berurutan, >2 kategori) → Ordinal (polr)
Jika respons count data (bilangan cacah ≥ 0) → Poisson, atau Negative Binomial jika terjadi overdispersion

Dataset: UCI Student Performance Dataset (student-mat.csv).

Analisis Regresi Logistik pada Data Prestasi Siswa

Studi Kasus Multinomial, Ordinal, dan Poisson

02 June 2026

1 Pendahuluan

2 Persiapan Data

2.1 Import dan Struktur Data

2.2 Konversi Tipe Variabel

2.3 Eksplorasi Awal

3 Studi Kasus 1: Regresi Logistik Multinomial

3.1 Latar Belakang

3.2 Distribusi Variabel Respons

3.3 Estimasi Model

3.4 Tabel Koefisien, RRR, dan Confidence Interval

3.5 Prediksi dan Evaluasi Model

3.6 Visualisasi Prediksi Probabilitas

3.7 Evaluasi Asumsi Model

3.7.1 Multikolinearitas

3.7.2 Asumsi IIA

3.8 Interpretasi Hasil

3.9 Kesimpulan dan Keterbatasan

4 Studi Kasus 2: Regresi Logistik Ordinal

4.1 Latar Belakang

4.2 Pembuatan Variabel Respons Ordinal

4.3 Distribusi Variabel Respons

4.4 Estimasi Model

4.5 Tabel Koefisien, OR, dan Confidence Interval

4.6 Prediksi dan Evaluasi Model

4.7 Visualisasi Prediksi Probabilitas

4.8 Visualisasi Parallel Lines (Cumulative Logit)

4.9 Evaluasi Asumsi Model

4.9.1 Brant Test

4.9.2 Parallel Lines Test (Formal)

4.10 Interpretasi Hasil

4.11 Kesimpulan dan Keterbatasan

5 Studi Kasus 3: Regresi Poisson

5.1 Latar Belakang

5.2 Distribusi Variabel Respons

5.3 Estimasi Model

5.4 Tabel Koefisien, IRR, dan Confidence Interval

5.5 Visualisasi Prediksi Rate

5.6 Evaluasi Asumsi Model

5.6.1 Dispersion Pearson

5.6.2 Cek Overdispersion & Nilai Nol

5.6.3 Mengapa Negative Binomial Lebih Sesuai

5.7 Perbandingan: Poisson vs OLS pada log(Y)

5.8 Interpretasi Hasil

5.9 Kesimpulan dan Keterbatasan

6 Penutup

6.1 Ringkasan Tiga Studi Kasus

6.2 Pemilihan Model Berdasarkan Jenis Respons