Pengaturan Path Data

Ganti path di bawah ini sesuai lokasi file CSV di komputermu. Setelah diisi, tidak perlu ubah apapun di seluruh dokumen ini.

# ── GANTI PATH DI BAWAH INI ──────────────────────────────────────────────────
# Contoh Windows : "C:/Users/Nama/Downloads/healthcare-dataset-stroke-data.csv"
# Contoh Mac     : "/Users/Nama/Downloads/healthcare-dataset-stroke-data.csv"

# Nama file CSV (jangan diubah kecuali nama filenya berbeda)
file_stroke  <- "healthcare-dataset-stroke-data.csv"
file_obesity <- "ObesityDataSet_raw_and_data_sinthetic.csv"
file_motor   <- "freMTPL2freq.csv"

# Auto-cari file: folder Rmd > working dir > subfolder umum
.find_csv <- function(fname) {
  candidates <- c(
    tryCatch(file.path(dirname(knitr::current_input(dir=TRUE)), fname), error=function(e) ""),
    fname,
    file.path("data",       fname),
    file.path("Data",       fname),
    file.path("dataset",    fname),
    file.path("Dataset",    fname),
    file.path("Tugas ADK",  fname)
  )
  found <- Filter(file.exists, candidates)
  if (length(found) == 0)
    stop(paste0("File tidak ditemukan: ", fname,
                "
Letakkan file CSV di folder yang SAMA dengan file .Rmd ini."))
  normalizePath(found[1])
}

path_stroke  <- .find_csv(file_stroke)
path_obesity <- .find_csv(file_obesity)
path_motor   <- .find_csv(file_motor)

cat("path_stroke :", path_stroke,  "
")
## path_stroke : C:\Users\user\OneDrive\Dokumen\Tugas ADK\healthcare-dataset-stroke-data.csv
cat("path_obesity:", path_obesity, "
")
## path_obesity: C:\Users\user\OneDrive\Dokumen\Tugas ADK\ObesityDataSet_raw_and_data_sinthetic.csv
cat("path_motor  :", path_motor,   "
")
## path_motor  : C:\Users\user\OneDrive\Dokumen\Tugas ADK\freMTPL2freq.csv
# ─────────────────────────────────────────────────────────────────────────────

1 Pendahuluan

Dalam statistika, tidak semua variabel respons berbentuk angka kontinu. Kadang kita ingin memodelkan:

  • Apakah seseorang stroke atau tidak? → 2 kategori → Regresi Biner
  • Seseorang masuk kategori berat badan apa? → banyak kategori nominal → Regresi Multinomial
  • Seseorang masuk kategori obesitas ringan, sedang, atau berat? → kategori berurutan → Regresi Ordinal
  • Berapa banyak klaim asuransi yang diajukan? → data hitung → Regresi Poisson


2 Bagian I — Regresi Logistik Biner

2.1 Apa itu Regresi Logistik Biner?

Regresi logistik biner digunakan ketika variabel respons hanya punya dua kemungkinan: ya/tidak, 1/0, sukses/gagal. Model ini memprediksi peluang terjadinya suatu kejadian lewat rumus:

\[\log\left(\frac{P(Y=1)}{P(Y=0)}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

Sisi kiri disebut log-odds atau logit. Interpretasi koefisien dilakukan lewat Odds Ratio (OR):

\[OR = e^{\hat\beta}\]

  • OR > 1 → prediktor meningkatkan peluang kejadian
  • OR < 1 → prediktor menurunkan peluang kejadian

2.2 Data yang Digunakan

Dataset: Stroke Prediction Dataset — Kaggle

Pertanyaan: Faktor apa yang memengaruhi kemungkinan seseorang mengalami stroke?

Variabel Keterangan
stroke Y: 1 = pernah stroke, 0 = tidak
age Usia (tahun)
gender Jenis kelamin
hypertension Riwayat hipertensi (1=ya)
heart_disease Penyakit jantung (1=ya)
ever_married Pernah menikah
work_type Jenis pekerjaan
avg_glucose_level Rata-rata kadar glukosa
bmi Indeks massa tubuh
smoking_status Status merokok

2.3 Persiapan Data

stroke_raw <- read.csv(path_stroke, stringsAsFactors = FALSE)

stroke <- stroke_raw |>
  dplyr::filter(gender != "Other") |>
  dplyr::mutate(
    bmi            = as.numeric(bmi),
    bmi            = ifelse(is.na(bmi), median(bmi, na.rm=TRUE), bmi),
    stroke         = as.integer(stroke),
    gender         = factor(gender),
    ever_married   = factor(ever_married),
    work_type      = factor(work_type),
    smoking_status = factor(smoking_status)
  )

cat("Jumlah baris:", nrow(stroke), "| Jumlah kolom:", ncol(stroke), "\n")
## Jumlah baris: 5109 | Jumlah kolom: 12
head(stroke[, c("age","gender","hypertension","stroke")], 5)

2.4 Distribusi Respons

stroke |>
  dplyr::count(stroke) |>
  dplyr::mutate(label  = ifelse(stroke==1,"Stroke","Tidak Stroke"),
                persen = round(n/sum(n)*100,1)) |>
  ggplot(aes(x=label, y=n, fill=label)) +
  geom_col(width=0.5) +
  geom_text(aes(label=paste0(n,"\n(",persen,"%)")),
            vjust=-0.4, size=4, color="#3d2b1f") +
  scale_fill_manual(values=c("Stroke"="#7b4f2e","Tidak Stroke"="#c9956c")) +
  labs(title="Distribusi Variabel Respons: Stroke", x=NULL, y="Jumlah Observasi") +
  theme_minimal(base_size=12) +
  theme(legend.position="none",
        plot.title=element_text(color="#5c3d2e", face="bold"),
        panel.grid.major.x=element_blank())

2.5 Estimasi Model

fit_biner <- glm(
  stroke ~ age + gender + hypertension + heart_disease +
           ever_married + work_type + avg_glucose_level +
           bmi + smoking_status,
  data   = stroke,
  family = binomial(link = "logit")
)

summary(fit_biner)
## 
## Call:
## glm(formula = stroke ~ age + gender + hypertension + heart_disease + 
##     ever_married + work_type + avg_glucose_level + bmi + smoking_status, 
##     family = binomial(link = "logit"), data = stroke)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -6.741149   0.782971  -8.610  < 2e-16 ***
## age                          0.074921   0.005832  12.847  < 2e-16 ***
## genderMale                   0.011478   0.141841   0.081 0.935502    
## hypertension                 0.401312   0.164910   2.434 0.014953 *  
## heart_disease                0.275832   0.191032   1.444 0.148766    
## ever_marriedYes             -0.189140   0.225219  -0.840 0.401018    
## work_typeGovt_job           -0.947043   0.835909  -1.133 0.257235    
## work_typeNever_worked      -10.312641 309.279046  -0.033 0.973400    
## work_typePrivate            -0.806679   0.819924  -0.984 0.325191    
## work_typeSelf-employed      -1.184617   0.840661  -1.409 0.158791    
## avg_glucose_level            0.004039   0.001198   3.371 0.000749 ***
## bmi                          0.001197   0.011321   0.106 0.915830    
## smoking_statusnever smoked  -0.211545   0.175705  -1.204 0.228598    
## smoking_statussmokes         0.115308   0.215282   0.536 0.592225    
## smoking_statusUnknown       -0.074250   0.208309  -0.356 0.721508    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1990.3  on 5108  degrees of freedom
## Residual deviance: 1581.5  on 5094  degrees of freedom
## AIC: 1611.5
## 
## Number of Fisher Scoring iterations: 14

2.6 Tabel Odds Ratio

tabel_or <- broom::tidy(fit_biner, exponentiate=TRUE, conf.int=TRUE) |>
  dplyr::filter(term != "(Intercept)") |>
  dplyr::transmute(
    Variabel       = term,
    OR             = round(estimate, 3),
    `CI 95% Bawah` = round(conf.low, 3),
    `CI 95% Atas`  = round(conf.high, 3),
    `p-value`      = round(p.value, 4),
    Signifikan     = ifelse(p.value < 0.05, "✔", "")
  )

kable(tabel_or, caption="Odds Ratio — Model Logistik Biner (Stroke)") |>
  kable_styling(bootstrap_options=c("striped","hover","condensed"),
                full_width=FALSE) |>
  row_spec(which(tabel_or$Signifikan=="✔"), bold=TRUE, background="#f5ede4")
Odds Ratio — Model Logistik Biner (Stroke)
Variabel OR CI 95% Bawah CI 95% Atas p-value Signifikan
age 1.078 1.066 1.090 0.0000
genderMale 1.012 0.765 1.334 0.9355
hypertension 1.494 1.075 2.054 0.0150
heart_disease 1.318 0.899 1.902 0.1488
ever_marriedYes 0.828 0.540 1.310 0.4010
work_typeGovt_job 0.388 0.089 2.737 0.2572
work_typeNever_worked 0.000 0.000 0.000 0.9734
work_typePrivate 0.446 0.106 3.083 0.3252
work_typeSelf-employed 0.306 0.069 2.172 0.1588
avg_glucose_level 1.004 1.002 1.006 0.0007
bmi 1.001 0.979 1.023 0.9158
smoking_statusnever smoked 0.809 0.574 1.145 0.2286
smoking_statussmokes 1.122 0.732 1.705 0.5922
smoking_statusUnknown 0.928 0.614 1.393 0.7215

2.7 Interpretasi Hasil

Template interpretasi: “Dengan mengontrol variabel lain, kenaikan 1 satuan [variabel] akan [meningkatkan/menurunkan] odds stroke sebesar [(OR−1)×100]%.”

Contoh: Jika OR age = 1.073 → setiap bertambah 1 tahun usia, odds stroke meningkat 7.3%.

Jika OR hypertension = 1.8 → penderita hipertensi memiliki odds stroke 1.8× lebih tinggi dibanding yang tidak.

2.8 Prediksi Probabilitas

profil_biner <- data.frame(
  age               = c(45, 65, 75),
  gender            = factor("Male",        levels=levels(stroke$gender)),
  hypertension      = c(0, 1, 1),
  heart_disease     = c(0, 0, 1),
  ever_married      = factor("Yes",         levels=levels(stroke$ever_married)),
  work_type         = factor("Private",     levels=levels(stroke$work_type)),
  avg_glucose_level = c(90, 130, 180),
  bmi               = c(25, 28, 31),
  smoking_status    = factor("never smoked",levels=levels(stroke$smoking_status))
)

profil_biner$Prob_Stroke <- round(
  predict(fit_biner, newdata=profil_biner, type="response"), 4)

profil_biner |>
  dplyr::select(age, hypertension, heart_disease, avg_glucose_level, Prob_Stroke) |>
  kable(caption="Prediksi Probabilitas Stroke per Profil Pasien") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)
Prediksi Probabilitas Stroke per Profil Pasien
age hypertension heart_disease avg_glucose_level Prob_Stroke
45 0 0 90 0.0152
65 1 0 130 0.1084
75 1 1 180 0.2938

2.9 Pemeriksaan Asumsi

# VIF — nilai > 10 menandakan masalah multikolinearitas
car::vif(fit_biner)
##                       GVIF Df GVIF^(1/(2*Df))
## age               1.453751  1        1.205716
## gender            1.045755  1        1.022622
## hypertension      1.067613  1        1.033253
## heart_disease     1.090645  1        1.044340
## ever_married      1.107310  1        1.052288
## work_type         1.419549  4        1.044765
## avg_glucose_level 1.109793  1        1.053467
## bmi               1.110130  1        1.053627
## smoking_status    1.107225  3        1.017121

Cara baca: Semua nilai VIF < 5 → tidak ada masalah multikolinearitas.


3 Bagian II — Regresi Logistik Multinomial

3.1 Apa itu Regresi Logistik Multinomial?

Regresi multinomial digunakan ketika respons punya lebih dari dua kategori yang tidak berurutan (nominal). Model membandingkan setiap kategori terhadap satu kategori referensi.

\[\log\left(\frac{P(Y=k)}{P(Y=K)}\right) = \beta_{0k} + \beta_{1k}X_1 + \cdots\]

Interpretasi lewat Relative Risk Ratio (RRR) = \(e^{\hat\beta_k}\):

  • RRR > 1 → lebih mungkin masuk kategori \(k\) vs referensi
  • RRR < 1 → lebih kecil kemungkinan masuk kategori \(k\) vs referensi

3.2 Data yang Digunakan

Dataset: Obesity Levels — Kaggle

Pertanyaan: Faktor gaya hidup apa yang memengaruhi jenis/tipe obesitas seseorang?

Variabel Y: NObeyesdad — 7 kelas tipe berat badan (diperlakukan nominal di sini)

Kategori referensi: Normal_Weight

3.3 Persiapan Data

obesity_raw <- read.csv(path_obesity, stringsAsFactors = FALSE)

obesity <- obesity_raw |>
  dplyr::mutate(
    NObeyesdad = factor(NObeyesdad),
    Gender     = factor(Gender),
    family_history_with_overweight = factor(family_history_with_overweight),
    FAVC  = factor(FAVC),
    CAEC  = factor(CAEC),
    SMOKE = factor(SMOKE),
    SCC   = factor(SCC),
    CALC  = factor(CALC),
    MTRANS= factor(MTRANS)
  )

# Tetapkan referensi
obesity$NObeyesdad <- relevel(obesity$NObeyesdad, ref="Normal_Weight")

cat("Jumlah baris:", nrow(obesity), "\n")
## Jumlah baris: 2111
cat("Kategori respons:", levels(obesity$NObeyesdad), "\n")
## Kategori respons: Normal_Weight Insufficient_Weight Obesity_Type_I Obesity_Type_II Obesity_Type_III Overweight_Level_I Overweight_Level_II

3.4 Distribusi Respons

obesity |>
  dplyr::count(NObeyesdad) |>
  dplyr::mutate(persen=round(n/sum(n)*100,1)) |>
  ggplot(aes(x=reorder(NObeyesdad,n), y=n, fill=NObeyesdad)) +
  geom_col() +
  geom_text(aes(label=paste0(n," (",persen,"%)")),
            hjust=-0.1, size=3.2, color="#3d2b1f") +
  coord_flip() +
  scale_fill_manual(values=warna_utama) +
  labs(title="Distribusi Kategori Tipe Berat Badan", x=NULL, y="Jumlah") +
  theme_minimal(base_size=11) +
  theme(legend.position="none",
        plot.title=element_text(color="#5c3d2e", face="bold")) +
  expand_limits(y=max(table(obesity$NObeyesdad))*1.3)

3.5 Estimasi Model

fit_multi <- nnet::multinom(
  NObeyesdad ~ Gender + Age + Height + Weight +
               family_history_with_overweight + FAVC +
               FCVC + NCP + SMOKE + CH2O + FAF + MTRANS,
  data  = obesity,
  trace = FALSE
)

summary(fit_multi)
## Call:
## nnet::multinom(formula = NObeyesdad ~ Gender + Age + Height + 
##     Weight + family_history_with_overweight + FAVC + FCVC + NCP + 
##     SMOKE + CH2O + FAF + MTRANS, data = obesity, trace = FALSE)
## 
## Coefficients:
##                     (Intercept) GenderMale         Age    Height    Weight
## Insufficient_Weight   -95.70236  -1.273058 -0.08425994  128.4207 -2.460110
## Obesity_Type_I        263.32940  -5.893068  0.31526124 -367.5787  4.414481
## Obesity_Type_II       225.89468  11.783111  2.22059231 -545.3063  7.463585
## Obesity_Type_III     -116.01933 -48.566005 -1.08453395 -338.5186  6.546760
## Overweight_Level_I     89.79076  -1.895187  0.00188651 -108.1353  1.342047
## Overweight_Level_II   168.52713  -1.944015  0.16869372 -224.0748  2.778598
##                     family_history_with_overweightyes     FAVCyes       FCVC
## Insufficient_Weight                        1.82957271   1.1426820  1.9502407
## Obesity_Type_I                             3.20423375   0.5682154  0.3731138
## Obesity_Type_II                            9.93588716 -25.5192673 -2.5003571
## Obesity_Type_III                         -26.25658951  27.3391802 46.3914153
## Overweight_Level_I                        -0.06536155   1.2207695 -0.5141224
## Overweight_Level_II                        2.80185202  -0.6592257 -0.1535333
##                             NCP    SMOKEyes        CH2O         FAF
## Insufficient_Weight  0.81533976  -2.2213139   2.2471063   0.1775288
## Obesity_Type_I      -0.38321551   0.6227479  -0.1006173  -1.0628420
## Obesity_Type_II      0.25549994 -10.3451950 -15.7188832  -7.0170587
## Obesity_Type_III    15.94889911  -9.2785165  -5.9690961 -15.9000831
## Overweight_Level_I  -0.08990469  -2.3201564  -0.3302217  -0.2348865
## Overweight_Level_II -0.56368318  -2.3913814  -0.2806610  -0.5472741
##                       MTRANSBike MTRANSMotorbike MTRANSPublic_Transportation
## Insufficient_Weight  -98.1185940     -96.4155325                  -2.0055661
## Obesity_Type_I       -69.3373813       0.9058887                   5.1688312
## Obesity_Type_II       15.3261408     -61.5621812                  21.2027563
## Obesity_Type_III      46.9222141       3.9533022                   6.3612255
## Overweight_Level_I     0.7992455      -3.4189088                   0.1185843
## Overweight_Level_II -130.3906301      -2.5703507                   2.5815644
##                     MTRANSWalking
## Insufficient_Weight     3.1061742
## Obesity_Type_I         -2.4993637
## Obesity_Type_II         6.0624484
## Obesity_Type_III       28.5261547
## Overweight_Level_I     -0.1806994
## Overweight_Level_II    -3.9610614
## 
## Std. Errors:
##                     (Intercept) GenderMale        Age   Height     Weight
## Insufficient_Weight   3.3882852  1.1034959 0.11027381 2.682253 0.17556006
## Obesity_Type_I        7.7412417  1.8675877 0.12042646 3.601233 0.14694835
## Obesity_Type_II       1.9913697  6.2292105 0.22056342 1.204080 0.12199060
## Obesity_Type_III      0.8108081  2.4191704 0.76749562 1.312614 0.46946770
## Overweight_Level_I    5.6882194  0.7775832 0.03979964 6.143633 0.09014147
## Overweight_Level_II   4.6768925  1.0504432 0.07427981 5.926989 0.10981510
##                     family_history_with_overweightyes   FAVCyes      FCVC
## Insufficient_Weight                         1.0293479 1.0355895 0.8213216
## Obesity_Type_I                              1.8910832 1.5077941 1.1718407
## Obesity_Type_II                             3.1313750 2.3428184 1.8736325
## Obesity_Type_III                            2.7628832 4.2410473 2.5697159
## Overweight_Level_I                          0.5512685 0.6778299 0.4634032
## Overweight_Level_II                         1.0009781 0.9414146 0.7561710
##                           NCP  SMOKEyes      CH2O       FAF   MTRANSBike
## Insufficient_Weight 0.5872139 5.8243068 0.9135471 0.4836683          NaN
## Obesity_Type_I      0.7472319 3.4237880 0.9304676 0.6949466 2.060265e-13
## Obesity_Type_II     1.2368859 7.0188747 1.9728235 1.2912772          NaN
## Obesity_Type_III    2.3865470 0.3428527 7.4987085 3.0383142 2.633521e-15
## Overweight_Level_I  0.3164128 1.9694747 0.4231180 0.2893308 2.498476e+00
## Overweight_Level_II 0.4395546 2.4277830 0.6238588 0.4158433 7.358191e-59
##                     MTRANSMotorbike MTRANSPublic_Transportation MTRANSWalking
## Insufficient_Weight             NaN                   1.0991830  5.207919e+00
## Obesity_Type_I         1.075410e+01                   2.2407594  3.111064e+00
## Obesity_Type_II                 NaN                   2.3434905  5.615480e-08
## Obesity_Type_III       2.480674e-15                   4.1722832  1.400897e+00
## Overweight_Level_I     4.509724e+00                   0.6960332  1.011338e+00
## Overweight_Level_II    6.179924e+00                   1.2034489  1.935702e+00
## 
## Residual Deviance: 415.3648 
## AIC: 607.3648

3.6 p-value Koefisien

z_stat <- coef(fit_multi) / summary(fit_multi)$standard.errors
p_val  <- round(2*(1-pnorm(abs(z_stat))), 4)

p_val[, 1:6] |>
  kable(caption="p-value Koefisien Multinomial (6 variabel pertama)") |>
  kable_styling(bootstrap_options=c("striped","hover","condensed"), font_size=11)
p-value Koefisien Multinomial (6 variabel pertama)
(Intercept) GenderMale Age Height Weight family_history_with_overweightyes
Insufficient_Weight 0 0.2486 0.4448 0 0 0.0755
Obesity_Type_I 0 0.0016 0.0088 0 0 0.0902
Obesity_Type_II 0 0.0585 0.0000 0 0 0.0015
Obesity_Type_III 0 0.0000 0.1576 0 0 0.0000
Overweight_Level_I 0 0.0148 0.9622 0 0 0.9056
Overweight_Level_II 0 0.0642 0.0231 0 0 0.0051

3.7 Tabel Relative Risk Ratio

round(exp(coef(fit_multi))[, 1:5], 3) |>
  kable(caption="RRR — Referensi: Normal_Weight (5 variabel pertama)") |>
  kable_styling(bootstrap_options=c("striped","hover","condensed"), font_size=11)
RRR — Referensi: Normal_Weight (5 variabel pertama)
(Intercept) GenderMale Age Height Weight
Insufficient_Weight 0.000000e+00 0.280 0.919 5.920834e+55 0.085
Obesity_Type_I 2.304126e+114 0.003 1.371 0.000000e+00 82.639
Obesity_Type_II 1.272951e+98 131020.715 9.213 0.000000e+00 1743.387
Obesity_Type_III 0.000000e+00 0.000 0.338 0.000000e+00 696.982
Overweight_Level_I 9.899954e+38 0.150 1.002 0.000000e+00 3.827
Overweight_Level_II 1.550254e+73 0.143 1.184 0.000000e+00 16.096

3.8 Interpretasi Hasil

Template interpretasi: “Dibanding Normal_Weight, seseorang dengan [kondisi X] memiliki risiko relatif [RRR]× lebih [tinggi/rendah] untuk masuk kategori [Y], dengan variabel lain dikontrol.”

3.9 Prediksi Probabilitas

profil_multi <- data.frame(
  Gender = factor(c("Male","Female"), levels=levels(obesity$Gender)),
  Age=c(25,35), Height=c(1.75,1.60), Weight=c(90,75),
  family_history_with_overweight=factor("yes",
    levels=levels(obesity$family_history_with_overweight)),
  FAVC  = factor("yes",   levels=levels(obesity$FAVC)),
  FCVC=2, NCP=3,
  SMOKE = factor("no",    levels=levels(obesity$SMOKE)),
  CH2O=2, FAF=1,
  MTRANS= factor("Public_Transportation", levels=levels(obesity$MTRANS))
)

pred_multi <- round(predict(fit_multi, newdata=profil_multi, type="probs"), 4)
rownames(pred_multi) <- c("A: Laki-laki 25th 90kg","B: Perempuan 35th 75kg")

pred_multi |>
  kable(caption="Prediksi Probabilitas Tipe Obesitas") |>
  kable_styling(bootstrap_options=c("striped","hover"), font_size=11)
Prediksi Probabilitas Tipe Obesitas
Normal_Weight Insufficient_Weight Obesity_Type_I Obesity_Type_II Obesity_Type_III Overweight_Level_I Overweight_Level_II
A: Laki-laki 25th 90kg 0 0 0.0235 0 0 0e+00 0.9765
B: Perempuan 35th 75kg 0 0 0.2097 0 0 1e-04 0.7902

3.10 Visualisasi Prediksi

as.data.frame(pred_multi) |>
  tibble::rownames_to_column("Profil") |>
  tidyr::pivot_longer(-Profil, names_to="Kategori", values_to="Probabilitas") |>
  dplyr::mutate(Kategori=factor(Kategori, levels=c(
    "Insufficient_Weight","Normal_Weight","Overweight_Level_I",
    "Overweight_Level_II","Obesity_Type_I","Obesity_Type_II","Obesity_Type_III"
  ))) |>
  ggplot(aes(x=Kategori, y=Probabilitas, fill=Profil)) +
  geom_col(position="dodge") +
  scale_fill_manual(values=c("#a0704a","#c9956c")) +
  labs(title="Probabilitas Prediksi Tipe Obesitas per Profil", x=NULL) +
  theme_minimal(base_size=10) +
  theme(axis.text.x=element_text(angle=30, hjust=1),
        plot.title=element_text(color="#5c3d2e", face="bold"),
        legend.position="top")


4 Bagian III — Regresi Logistik Ordinal

4.1 Apa itu Regresi Logistik Ordinal?

Regresi ordinal digunakan ketika kategori respons punya urutan yang bermakna — misalnya kurus < normal < gemuk < obesitas. Model yang paling umum adalah Proportional Odds Model:

\[\log\left(\frac{P(Y \leq k)}{P(Y > k)}\right) = \alpha_k - (\beta_1 X_1 + \cdots + \beta_p X_p)\]

Interpretasi lewat OR = \(e^{\hat\beta}\):

  • OR > 1 → prediktor meningkatkan odds berada di kategori lebih tinggi
  • OR < 1 → prediktor menurunkan odds berada di kategori lebih tinggi

4.2 Data yang Digunakan

Dataset: Obesity Levels — Kaggle (dataset yang sama, perlakuan berbeda)

Pertanyaan: Faktor gaya hidup apa yang memengaruhi tingkat keparahan obesitas seseorang?

Variabel Y: NObeyesdad — diperlakukan ordinal dengan urutan:

\[\text{Insufficient} < \text{Normal} < \text{Overweight I} < \text{Overweight II} < \text{Obesity I} < \text{Obesity II} < \text{Obesity III}\]

4.3 Persiapan Data

# Gunakan data obesity yang sama, tapi Y-nya dijadikan factor ordered
obesity_ord <- obesity_raw |>
  dplyr::mutate(
    # Definisikan urutan kategori secara eksplisit
    NObeyesdad = factor(NObeyesdad,
      levels  = c("Insufficient_Weight","Normal_Weight",
                  "Overweight_Level_I","Overweight_Level_II",
                  "Obesity_Type_I","Obesity_Type_II","Obesity_Type_III"),
      ordered = TRUE),   # <-- ordered=TRUE agar R tahu ini ordinal!
    Gender = factor(Gender),
    family_history_with_overweight = factor(family_history_with_overweight),
    FAVC  = factor(FAVC),
    SMOKE = factor(SMOKE),
    CAEC  = factor(CAEC),
    CALC  = factor(CALC),
    SCC   = factor(SCC),
    MTRANS= factor(MTRANS)
  )

cat("Urutan kategori:\n")
## Urutan kategori:
print(levels(obesity_ord$NObeyesdad))
## [1] "Insufficient_Weight" "Normal_Weight"       "Overweight_Level_I" 
## [4] "Overweight_Level_II" "Obesity_Type_I"      "Obesity_Type_II"    
## [7] "Obesity_Type_III"

4.4 Distribusi Respons

obesity_ord |>
  dplyr::count(NObeyesdad) |>
  dplyr::mutate(persen=round(n/sum(n)*100,1)) |>
  ggplot(aes(x=NObeyesdad, y=n, fill=NObeyesdad)) +
  geom_col(width=0.55) +
  geom_text(aes(label=paste0(n,"\n(",persen,"%)")),
            vjust=-0.4, size=3.5, color="#3d2b1f") +
  scale_fill_manual(values=warna_utama) +
  labs(title="Distribusi Tingkat Berat Badan (Ordinal)",
       x=NULL, y="Jumlah Observasi") +
  theme_minimal(base_size=11) +
  theme(legend.position="none",
        axis.text.x=element_text(angle=25, hjust=1),
        plot.title=element_text(color="#5c3d2e", face="bold"))

4.5 Estimasi Model

# Scale variabel numerik agar optimisasi lebih stabil
obesity_ord_sc <- obesity_ord |>
  dplyr::mutate(dplyr::across(c(Age, Height, Weight, FCVC, NCP, CH2O, FAF),
                               ~ as.numeric(scale(.))))

# Menggunakan ordinal::clm (lebih stabil secara numerik dibanding MASS::polr)
# Keduanya menghasilkan Proportional Odds Model yang sama
fit_ordinal <- ordinal::clm(
  NObeyesdad ~ Gender + Age + Height + Weight +
               family_history_with_overweight + FAVC +
               FCVC + NCP + SMOKE + CH2O + FAF + MTRANS,
  data   = obesity_ord_sc,
  link   = "logit"
)

summary(fit_ordinal)
## formula: 
## NObeyesdad ~ Gender + Age + Height + Weight + family_history_with_overweight + FAVC + FCVC + NCP + SMOKE + CH2O + FAF + MTRANS
## data:    obesity_ord_sc
## 
##  link  threshold nobs logLik  AIC     niter max.grad cond.H 
##  logit flexible  2111 -524.00 1090.00 10(0) 7.96e-11 2.1e+03
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## GenderMale                        -1.77475    0.26740  -6.637 3.20e-11 ***
## Age                                0.26650    0.10930   2.438 0.014760 *  
## Height                            -7.78292    0.33639 -23.137  < 2e-16 ***
## Weight                            25.89478    1.09454  23.658  < 2e-16 ***
## family_history_with_overweightyes  1.51972    0.26653   5.702 1.18e-08 ***
## FAVCyes                            0.07068    0.25052   0.282 0.777846    
## FCVC                               0.19128    0.09415   2.032 0.042197 *  
## NCP                                0.28837    0.08413   3.428 0.000608 ***
## SMOKEyes                          -1.61665    0.45355  -3.564 0.000365 ***
## CH2O                              -0.27110    0.08682  -3.123 0.001793 ** 
## FAF                               -0.17583    0.09430  -1.865 0.062250 .  
## MTRANSBike                         3.43432    1.17463   2.924 0.003458 ** 
## MTRANSMotorbike                    2.75443    1.07296   2.567 0.010255 *  
## MTRANSPublic_Transportation        1.75473    0.26488   6.625 3.48e-11 ***
## MTRANSWalking                      0.74913    0.56215   1.333 0.182657    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##                                        Estimate Std. Error z value
## Insufficient_Weight|Normal_Weight      -28.5042     1.2997 -21.932
## Normal_Weight|Overweight_Level_I       -13.2652     0.7051 -18.812
## Overweight_Level_I|Overweight_Level_II  -6.6876     0.5462 -12.244
## Overweight_Level_II|Obesity_Type_I       1.9040     0.4872   3.908
## Obesity_Type_I|Obesity_Type_II          16.7557     0.8824  18.989
## Obesity_Type_II|Obesity_Type_III        27.9193     1.2427  22.467

4.6 Tabel Koefisien dan Odds Ratio

# ordinal::clm menghasilkan kolom "z value" dan "Pr(>|z|)"
coef_tbl <- coef(summary(fit_ordinal))

# Ambil hanya baris koefisien prediktor (bukan threshold/intercept)
beta_rows <- !grepl("\\|", rownames(coef_tbl))
coef_beta <- coef_tbl[beta_rows, , drop=FALSE]

data.frame(
  Variabel   = rownames(coef_beta),
  Koefisien  = round(coef_beta[,"Estimate"], 4),
  SE         = round(coef_beta[,"Std. Error"], 4),
  OR         = round(exp(coef_beta[,"Estimate"]), 4),
  `p-value`  = round(coef_beta[,"Pr(>|z|)"], 4),
  Signifikan = ifelse(coef_beta[,"Pr(>|z|)"] < 0.05, "✔", ""),
  check.names= FALSE
) |>
  kable(caption="Koefisien dan Odds Ratio — Model Ordinal") |>
  kable_styling(bootstrap_options=c("striped","hover","condensed"))
Koefisien dan Odds Ratio — Model Ordinal
Variabel Koefisien SE OR p-value Signifikan
GenderMale GenderMale -1.7748 0.2674 1.695000e-01 0.0000
Age Age 0.2665 0.1093 1.305400e+00 0.0148
Height Height -7.7829 0.3364 4.000000e-04 0.0000
Weight Weight 25.8948 1.0945 1.761817e+11 0.0000
family_history_with_overweightyes family_history_with_overweightyes 1.5197 0.2665 4.570900e+00 0.0000
FAVCyes FAVCyes 0.0707 0.2505 1.073200e+00 0.7778
FCVC FCVC 0.1913 0.0942 1.210800e+00 0.0422
NCP NCP 0.2884 0.0841 1.334200e+00 0.0006
SMOKEyes SMOKEyes -1.6167 0.4535 1.986000e-01 0.0004
CH2O CH2O -0.2711 0.0868 7.625000e-01 0.0018
FAF FAF -0.1758 0.0943 8.388000e-01 0.0622
MTRANSBike MTRANSBike 3.4343 1.1746 3.101040e+01 0.0035
MTRANSMotorbike MTRANSMotorbike 2.7544 1.0730 1.571200e+01 0.0103
MTRANSPublic_Transportation MTRANSPublic_Transportation 1.7547 0.2649 5.781900e+00 0.0000
MTRANSWalking MTRANSWalking 0.7491 0.5621 2.115200e+00 0.1827

4.7 Interpretasi Hasil

Template interpretasi: “Setiap kenaikan 1 unit [variabel], odds berada di kategori berat badan lebih tinggi [meningkat/menurun] sebesar [(OR−1)×100]%, dengan variabel lain dikontrol.”

Contoh: Jika OR Weight = 1.15 → setiap kenaikan 1 kg berat badan, odds masuk kategori lebih tinggi (lebih gemuk) meningkat 15%.

4.8 Prediksi Probabilitas

# Helper: scale nilai baru menggunakan mean & sd dari data training
.sc <- function(x, col) (x - mean(obesity_ord[[col]], na.rm=TRUE)) / sd(obesity_ord[[col]], na.rm=TRUE)

profil_ord <- data.frame(
  Gender = factor(c("Male","Female","Male"),
                  levels=levels(obesity_ord$Gender)),
  Age    = .sc(c(22,35,50),    "Age"),
  Height = .sc(c(1.75,1.60,1.70), "Height"),
  Weight = .sc(c(65,80,100),   "Weight"),
  family_history_with_overweight=factor("yes",
    levels=levels(obesity_ord$family_history_with_overweight)),
  FAVC  = factor("yes",  levels=levels(obesity_ord$FAVC)),
  FCVC  = .sc(2, "FCVC"),
  NCP   = .sc(3, "NCP"),
  SMOKE = factor("no",   levels=levels(obesity_ord$SMOKE)),
  CH2O  = .sc(2, "CH2O"),
  FAF   = .sc(1, "FAF"),
  MTRANS= factor("Public_Transportation",
                 levels=levels(obesity_ord$MTRANS))
)

pred_ord <- round(predict(fit_ordinal, newdata=profil_ord, type="prob")$fit, 4)
rownames(pred_ord) <- c("A: 22th 65kg","B: 35th 80kg","C: 50th 100kg")

pred_ord |>
  kable(caption="Prediksi Probabilitas Tingkat Berat Badan per Profil") |>
  kable_styling(bootstrap_options=c("striped","hover"), font_size=11)
Prediksi Probabilitas Tingkat Berat Badan per Profil
Insufficient_Weight Normal_Weight Overweight_Level_I Overweight_Level_II Obesity_Type_I Obesity_Type_II Obesity_Type_III
A: 22th 65kg 0.0102 0.9898 0 0.0000 0.0000 0.0000 0
B: 35th 80kg 0.0000 0.0000 0 0.0212 0.9788 0.0000 0
C: 50th 100kg 0.0000 0.0000 0 0.0000 0.6753 0.3246 0

4.9 Visualisasi Prediksi

as.data.frame(pred_ord) |>
  tibble::rownames_to_column("Profil") |>
  tidyr::pivot_longer(-Profil, names_to="Kategori", values_to="Probabilitas") |>
  dplyr::mutate(Kategori=factor(Kategori, levels=c(
    "Insufficient_Weight","Normal_Weight","Overweight_Level_I",
    "Overweight_Level_II","Obesity_Type_I","Obesity_Type_II","Obesity_Type_III"
  ))) |>
  ggplot(aes(x=Kategori, y=Probabilitas, fill=Kategori)) +
  geom_col() +
  facet_wrap(~Profil) +
  scale_fill_manual(values=warna_utama) +
  labs(title="Prediksi Probabilitas Tingkat Berat Badan per Profil", x=NULL) +
  theme_minimal(base_size=9) +
  theme(axis.text.x=element_text(angle=35, hjust=1),
        legend.position="none",
        plot.title=element_text(color="#5c3d2e", face="bold"),
        strip.text=element_text(color="#5c3d2e", face="bold"))

4.10 Uji Asumsi Proportional Odds

# fit_ordinal sudah menggunakan ordinal::clm, langsung pakai untuk nominal_test
ordinal::nominal_test(fit_ordinal)

Cara baca: p-value > 0.05 → asumsi proportional odds terpenuhi → model ordinal valid.


5 Bagian IV — Regresi Poisson

5.1 Apa itu Regresi Poisson?

Regresi Poisson digunakan untuk data hitung (count data): jumlah kejadian dalam suatu periode. Model menggunakan log link:

\[\log(\mu_i) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

Jika tiap unit punya durasi observasi berbeda, tambahkan offset:

\[\log(\mu_i) = \log(t_i) + \beta_0 + \beta_1 X_1 + \cdots\]

Interpretasi lewat Incidence Rate Ratio (IRR) = \(e^{\hat\beta}\):

  • IRR = 1.15 → prediktor naik 1 unit, rata-rata count naik 15%
  • IRR = 0.90 → prediktor naik 1 unit, rata-rata count turun 10%

5.2 Data yang Digunakan

Dataset: French Motor Claims — freMTPL2freq — Kaggle

Pertanyaan: Faktor apa yang memengaruhi frekuensi klaim asuransi kendaraan bermotor?

Variabel Keterangan
ClaimNb Y: jumlah klaim per polis (count)
Exposure Durasi polis dalam tahun (dipakai sebagai offset)
Area Zona wilayah (A–F)
VehPower Tenaga kendaraan
VehAge Usia kendaraan
DrivAge Usia pengemudi
BonusMalus Skor riwayat klaim (100=netral, >100=berisiko)
VehBrand Merek kendaraan
Density Kepadatan penduduk

5.3 Persiapan Data

motor_raw <- read.csv(path_motor, stringsAsFactors = FALSE)

motor <- motor_raw |>
  dplyr::mutate(
    Area     = factor(Area),
    VehBrand = factor(VehBrand),
    VehGas   = factor(VehGas),
    Region   = factor(Region),
    Exposure = pmin(Exposure, 1),
    ClaimNb  = as.integer(ClaimNb)
  ) |>
  dplyr::filter(ClaimNb <= 4)

cat("Baris:", nrow(motor),
    "| Rata-rata klaim:", round(mean(motor$ClaimNb),4),
    "| Variansi:", round(var(motor$ClaimNb),4), "\n")
## Baris: 678004 | Rata-rata klaim: 0.0531 | Variansi: 0.0564

5.4 Distribusi Respons

motor |>
  dplyr::count(ClaimNb) |>
  dplyr::mutate(persen=round(n/sum(n)*100,2)) |>
  ggplot(aes(x=factor(ClaimNb), y=n, fill=factor(ClaimNb))) +
  geom_col(width=0.55) +
  geom_text(aes(label=paste0(n,"\n(",persen,"%)")),
            vjust=-0.3, size=3.5, color="#3d2b1f") +
  scale_fill_manual(values=c("0"="#e8c9a0","1"="#c9956c",
                             "2"="#a0704a","3"="#7b4f2e","4"="#5c3d2e")) +
  labs(title="Distribusi Jumlah Klaim Asuransi per Polis",
       x="Jumlah Klaim", y="Jumlah Polis") +
  theme_minimal(base_size=12) +
  theme(legend.position="none",
        plot.title=element_text(color="#5c3d2e", face="bold"))

5.5 Estimasi Model

fit_poisson <- glm(
  ClaimNb ~ Area + VehPower + VehAge + DrivAge +
            BonusMalus + VehBrand + VehGas + Density +
            offset(log(Exposure)),
  data   = motor,
  family = poisson(link = "log")
)

summary(fit_poisson)
## 
## Call:
## glm(formula = ClaimNb ~ Area + VehPower + VehAge + DrivAge + 
##     BonusMalus + VehBrand + VehGas + Density + offset(log(Exposure)), 
##     family = poisson(link = "log"), data = motor)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.945e+00  4.093e-02 -96.392  < 2e-16 ***
## AreaB          4.474e-02  2.149e-02   2.082   0.0374 *  
## AreaC          7.494e-02  1.740e-02   4.307 1.65e-05 ***
## AreaD          1.460e-01  1.841e-02   7.930 2.20e-15 ***
## AreaE          1.903e-01  2.435e-02   7.816 5.46e-15 ***
## AreaF          1.362e-01  8.810e-02   1.546   0.1220    
## VehPower       1.565e-02  2.753e-03   5.687 1.30e-08 ***
## VehAge        -3.875e-02  1.165e-03 -33.278  < 2e-16 ***
## DrivAge        6.549e-03  3.936e-04  16.640  < 2e-16 ***
## BonusMalus     2.254e-02  3.070e-04  73.420  < 2e-16 ***
## VehBrandB10   -6.149e-03  3.681e-02  -0.167   0.8673    
## VehBrandB11    7.308e-02  3.960e-02   1.845   0.0650 .  
## VehBrandB12    1.266e-01  1.671e-02   7.576 3.56e-14 ***
## VehBrandB13    3.014e-02  4.092e-02   0.737   0.4614    
## VehBrandB14   -1.609e-01  7.842e-02  -2.052   0.0401 *  
## VehBrandB2    -1.105e-02  1.530e-02  -0.722   0.4701    
## VehBrandB3    -4.151e-03  2.187e-02  -0.190   0.8494    
## VehBrandB4    -3.927e-02  2.973e-02  -1.321   0.1865    
## VehBrandB5     6.250e-02  2.475e-02   2.526   0.0116 *  
## VehBrandB6    -4.562e-02  2.836e-02  -1.609   0.1077    
## VehGasRegular  5.295e-02  1.094e-02   4.842 1.28e-06 ***
## Density        1.626e-06  3.651e-06   0.445   0.6560    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 223632  on 678003  degrees of freedom
## Residual deviance: 216567  on 677982  degrees of freedom
## AIC: 285898
## 
## Number of Fisher Scoring iterations: 6

5.6 Tabel IRR

tabel_irr <- broom::tidy(fit_poisson, exponentiate=TRUE, conf.int=TRUE) |>
  dplyr::filter(term != "(Intercept)") |>
  dplyr::transmute(
    Variabel       = term,
    IRR            = round(estimate, 4),
    `CI 95% Bawah` = round(conf.low, 4),
    `CI 95% Atas`  = round(conf.high, 4),
    `p-value`      = round(p.value, 4),
    Signifikan     = ifelse(p.value < 0.05, "✔", "")
  ) |>
  head(15)

kable(tabel_irr, caption="Incidence Rate Ratio — Model Poisson (15 teratas)") |>
  kable_styling(bootstrap_options=c("striped","hover","condensed")) |>
  row_spec(which(tabel_irr$Signifikan=="✔"), bold=TRUE, background="#f5ede4")
Incidence Rate Ratio — Model Poisson (15 teratas)
Variabel IRR CI 95% Bawah CI 95% Atas p-value Signifikan
AreaB 1.0458 1.0026 1.0907 0.0374
AreaC 1.0778 1.0417 1.1153 0.0000
AreaD 1.1572 1.1162 1.1998 0.0000
AreaE 1.2096 1.1532 1.2687 0.0000
AreaF 1.1459 0.9631 1.3603 0.1220
VehPower 1.0158 1.0103 1.0213 0.0000
VehAge 0.9620 0.9598 0.9642 0.0000
DrivAge 1.0066 1.0058 1.0073 0.0000
BonusMalus 1.0228 1.0222 1.0234 0.0000
VehBrandB10 0.9939 0.9241 1.0675 0.8673
VehBrandB11 1.0758 0.9947 1.1617 0.0650
VehBrandB12 1.1350 1.0984 1.1727 0.0000
VehBrandB13 1.0306 0.9503 1.1157 0.4614
VehBrandB14 0.8513 0.7272 0.9891 0.0401
VehBrandB2 0.9890 0.9598 1.0191 0.4701

5.7 Interpretasi Hasil

Template interpretasi: “Setiap kenaikan 1 unit [variabel], rata-rata jumlah klaim [naik/turun] sebesar [(IRR−1)×100]%, dengan variabel lain dikontrol.”

5.8 Visualisasi Rate Prediksi

motor |>
  dplyr::mutate(rate_pred = fitted(fit_poisson) / Exposure) |>
  dplyr::sample_n(3000) |>
  ggplot(aes(x=DrivAge, y=rate_pred)) +
  geom_point(alpha=0.15, color="#a0704a", size=0.7) +
  geom_smooth(method="loess", color="#5c3d2e",
              fill="#e8c9a0", se=TRUE) +
  labs(title="Rate Klaim Prediksi vs Usia Pengemudi",
       x="Usia Pengemudi (tahun)", y="Rate Klaim (klaim/tahun)") +
  theme_minimal(base_size=12) +
  theme(plot.title=element_text(color="#5c3d2e", face="bold"))

5.9 Pemeriksaan Asumsi: Overdispersion

cat("Rasio Deviance/df:", round(deviance(fit_poisson)/df.residual(fit_poisson), 3), "\n")
## Rasio Deviance/df: 0.319
AER::dispersiontest(fit_poisson)
## 
##  Overdispersion test
## 
## data:  fit_poisson
## z = 8.0719, p-value = 3.459e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   1.179784

Cara baca: Jika p-value < 0.05 → terjadi overdispersion → gunakan Negative Binomial (MASS::glm.nb()).

# Jalankan ini jika terjadi overdispersion:
fit_nb <- MASS::glm.nb(
  ClaimNb ~ Area + VehPower + VehAge + DrivAge +
            BonusMalus + VehBrand + VehGas + Density +
            offset(log(Exposure)),
  data = motor
)
summary(fit_nb)

6 Ringkasan Perbandingan Model

Perbandingan Keempat Model Regresi
Aspek Biner Multinomial Ordinal Poisson
Tipe Respons 2 kategori ≥3 nominal ≥3 berurutan Count (0,1,2,…)
Distribusi Binomial Multinomial Multinomial kumulatif Poisson
Link Logit Log-odds vs ref Cumulative logit Log
Fungsi R glm(…, family=binomial) nnet::multinom() MASS::polr() glm(…, family=poisson)
Koefisien dibaca sebagai Odds Ratio (OR) Relative Risk Ratio (RRR) Odds Ratio (OR) Incidence Rate Ratio (IRR)
Asumsi kritis Linearitas log-odds IIA Proportional Odds Equidispersion
Dataset Stroke Prediction Obesity Levels Obesity Levels (ordered) French Motor Claims

7 Referensi

  • Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Wiley.
  • Dobson, A. J., & Barnett, A. G. (2018). An Introduction to Generalized Linear Models. CRC Press.
  • Fedesoriano (2021). Stroke Prediction Dataset. Kaggle.
  • Mehrparvar, F. (2022). Obesity Levels Dataset. Kaggle.
  • Loser, F. (2019). French Motor Claims Datasets freMTPL2freq. Kaggle.