Ganti path di bawah ini sesuai lokasi file CSV di komputermu. Setelah diisi, tidak perlu ubah apapun di seluruh dokumen ini.
# ── GANTI PATH DI BAWAH INI ──────────────────────────────────────────────────
# Contoh Windows : "C:/Users/Nama/Downloads/healthcare-dataset-stroke-data.csv"
# Contoh Mac : "/Users/Nama/Downloads/healthcare-dataset-stroke-data.csv"
# Nama file CSV (jangan diubah kecuali nama filenya berbeda)
file_stroke <- "healthcare-dataset-stroke-data.csv"
file_obesity <- "ObesityDataSet_raw_and_data_sinthetic.csv"
file_motor <- "freMTPL2freq.csv"
# Auto-cari file: folder Rmd > working dir > subfolder umum
.find_csv <- function(fname) {
candidates <- c(
tryCatch(file.path(dirname(knitr::current_input(dir=TRUE)), fname), error=function(e) ""),
fname,
file.path("data", fname),
file.path("Data", fname),
file.path("dataset", fname),
file.path("Dataset", fname),
file.path("Tugas ADK", fname)
)
found <- Filter(file.exists, candidates)
if (length(found) == 0)
stop(paste0("File tidak ditemukan: ", fname,
"
Letakkan file CSV di folder yang SAMA dengan file .Rmd ini."))
normalizePath(found[1])
}
path_stroke <- .find_csv(file_stroke)
path_obesity <- .find_csv(file_obesity)
path_motor <- .find_csv(file_motor)
cat("path_stroke :", path_stroke, "
")## path_stroke : C:\Users\user\OneDrive\Dokumen\Tugas ADK\healthcare-dataset-stroke-data.csv
## path_obesity: C:\Users\user\OneDrive\Dokumen\Tugas ADK\ObesityDataSet_raw_and_data_sinthetic.csv
## path_motor : C:\Users\user\OneDrive\Dokumen\Tugas ADK\freMTPL2freq.csv
Dalam statistika, tidak semua variabel respons berbentuk angka kontinu. Kadang kita ingin memodelkan:
Regresi logistik biner digunakan ketika variabel respons hanya punya dua kemungkinan: ya/tidak, 1/0, sukses/gagal. Model ini memprediksi peluang terjadinya suatu kejadian lewat rumus:
\[\log\left(\frac{P(Y=1)}{P(Y=0)}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]
Sisi kiri disebut log-odds atau logit. Interpretasi koefisien dilakukan lewat Odds Ratio (OR):
\[OR = e^{\hat\beta}\]
Dataset: Stroke Prediction Dataset — Kaggle
Pertanyaan: Faktor apa yang memengaruhi kemungkinan seseorang mengalami stroke?
| Variabel | Keterangan |
|---|---|
stroke |
Y: 1 = pernah stroke, 0 = tidak |
age |
Usia (tahun) |
gender |
Jenis kelamin |
hypertension |
Riwayat hipertensi (1=ya) |
heart_disease |
Penyakit jantung (1=ya) |
ever_married |
Pernah menikah |
work_type |
Jenis pekerjaan |
avg_glucose_level |
Rata-rata kadar glukosa |
bmi |
Indeks massa tubuh |
smoking_status |
Status merokok |
stroke_raw <- read.csv(path_stroke, stringsAsFactors = FALSE)
stroke <- stroke_raw |>
dplyr::filter(gender != "Other") |>
dplyr::mutate(
bmi = as.numeric(bmi),
bmi = ifelse(is.na(bmi), median(bmi, na.rm=TRUE), bmi),
stroke = as.integer(stroke),
gender = factor(gender),
ever_married = factor(ever_married),
work_type = factor(work_type),
smoking_status = factor(smoking_status)
)
cat("Jumlah baris:", nrow(stroke), "| Jumlah kolom:", ncol(stroke), "\n")## Jumlah baris: 5109 | Jumlah kolom: 12
stroke |>
dplyr::count(stroke) |>
dplyr::mutate(label = ifelse(stroke==1,"Stroke","Tidak Stroke"),
persen = round(n/sum(n)*100,1)) |>
ggplot(aes(x=label, y=n, fill=label)) +
geom_col(width=0.5) +
geom_text(aes(label=paste0(n,"\n(",persen,"%)")),
vjust=-0.4, size=4, color="#3d2b1f") +
scale_fill_manual(values=c("Stroke"="#7b4f2e","Tidak Stroke"="#c9956c")) +
labs(title="Distribusi Variabel Respons: Stroke", x=NULL, y="Jumlah Observasi") +
theme_minimal(base_size=12) +
theme(legend.position="none",
plot.title=element_text(color="#5c3d2e", face="bold"),
panel.grid.major.x=element_blank())fit_biner <- glm(
stroke ~ age + gender + hypertension + heart_disease +
ever_married + work_type + avg_glucose_level +
bmi + smoking_status,
data = stroke,
family = binomial(link = "logit")
)
summary(fit_biner)##
## Call:
## glm(formula = stroke ~ age + gender + hypertension + heart_disease +
## ever_married + work_type + avg_glucose_level + bmi + smoking_status,
## family = binomial(link = "logit"), data = stroke)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.741149 0.782971 -8.610 < 2e-16 ***
## age 0.074921 0.005832 12.847 < 2e-16 ***
## genderMale 0.011478 0.141841 0.081 0.935502
## hypertension 0.401312 0.164910 2.434 0.014953 *
## heart_disease 0.275832 0.191032 1.444 0.148766
## ever_marriedYes -0.189140 0.225219 -0.840 0.401018
## work_typeGovt_job -0.947043 0.835909 -1.133 0.257235
## work_typeNever_worked -10.312641 309.279046 -0.033 0.973400
## work_typePrivate -0.806679 0.819924 -0.984 0.325191
## work_typeSelf-employed -1.184617 0.840661 -1.409 0.158791
## avg_glucose_level 0.004039 0.001198 3.371 0.000749 ***
## bmi 0.001197 0.011321 0.106 0.915830
## smoking_statusnever smoked -0.211545 0.175705 -1.204 0.228598
## smoking_statussmokes 0.115308 0.215282 0.536 0.592225
## smoking_statusUnknown -0.074250 0.208309 -0.356 0.721508
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1990.3 on 5108 degrees of freedom
## Residual deviance: 1581.5 on 5094 degrees of freedom
## AIC: 1611.5
##
## Number of Fisher Scoring iterations: 14
tabel_or <- broom::tidy(fit_biner, exponentiate=TRUE, conf.int=TRUE) |>
dplyr::filter(term != "(Intercept)") |>
dplyr::transmute(
Variabel = term,
OR = round(estimate, 3),
`CI 95% Bawah` = round(conf.low, 3),
`CI 95% Atas` = round(conf.high, 3),
`p-value` = round(p.value, 4),
Signifikan = ifelse(p.value < 0.05, "✔", "")
)
kable(tabel_or, caption="Odds Ratio — Model Logistik Biner (Stroke)") |>
kable_styling(bootstrap_options=c("striped","hover","condensed"),
full_width=FALSE) |>
row_spec(which(tabel_or$Signifikan=="✔"), bold=TRUE, background="#f5ede4")| Variabel | OR | CI 95% Bawah | CI 95% Atas | p-value | Signifikan |
|---|---|---|---|---|---|
| age | 1.078 | 1.066 | 1.090 | 0.0000 | ✔ |
| genderMale | 1.012 | 0.765 | 1.334 | 0.9355 | |
| hypertension | 1.494 | 1.075 | 2.054 | 0.0150 | ✔ |
| heart_disease | 1.318 | 0.899 | 1.902 | 0.1488 | |
| ever_marriedYes | 0.828 | 0.540 | 1.310 | 0.4010 | |
| work_typeGovt_job | 0.388 | 0.089 | 2.737 | 0.2572 | |
| work_typeNever_worked | 0.000 | 0.000 | 0.000 | 0.9734 | |
| work_typePrivate | 0.446 | 0.106 | 3.083 | 0.3252 | |
| work_typeSelf-employed | 0.306 | 0.069 | 2.172 | 0.1588 | |
| avg_glucose_level | 1.004 | 1.002 | 1.006 | 0.0007 | ✔ |
| bmi | 1.001 | 0.979 | 1.023 | 0.9158 | |
| smoking_statusnever smoked | 0.809 | 0.574 | 1.145 | 0.2286 | |
| smoking_statussmokes | 1.122 | 0.732 | 1.705 | 0.5922 | |
| smoking_statusUnknown | 0.928 | 0.614 | 1.393 | 0.7215 |
Template interpretasi: “Dengan mengontrol variabel lain, kenaikan 1 satuan [variabel] akan [meningkatkan/menurunkan] odds stroke sebesar [(OR−1)×100]%.”
Contoh: Jika OR age = 1.073 → setiap bertambah 1 tahun
usia, odds stroke meningkat 7.3%.
Jika OR hypertension = 1.8 → penderita hipertensi
memiliki odds stroke 1.8× lebih tinggi dibanding yang
tidak.
profil_biner <- data.frame(
age = c(45, 65, 75),
gender = factor("Male", levels=levels(stroke$gender)),
hypertension = c(0, 1, 1),
heart_disease = c(0, 0, 1),
ever_married = factor("Yes", levels=levels(stroke$ever_married)),
work_type = factor("Private", levels=levels(stroke$work_type)),
avg_glucose_level = c(90, 130, 180),
bmi = c(25, 28, 31),
smoking_status = factor("never smoked",levels=levels(stroke$smoking_status))
)
profil_biner$Prob_Stroke <- round(
predict(fit_biner, newdata=profil_biner, type="response"), 4)
profil_biner |>
dplyr::select(age, hypertension, heart_disease, avg_glucose_level, Prob_Stroke) |>
kable(caption="Prediksi Probabilitas Stroke per Profil Pasien") |>
kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)| age | hypertension | heart_disease | avg_glucose_level | Prob_Stroke |
|---|---|---|---|---|
| 45 | 0 | 0 | 90 | 0.0152 |
| 65 | 1 | 0 | 130 | 0.1084 |
| 75 | 1 | 1 | 180 | 0.2938 |
## GVIF Df GVIF^(1/(2*Df))
## age 1.453751 1 1.205716
## gender 1.045755 1 1.022622
## hypertension 1.067613 1 1.033253
## heart_disease 1.090645 1 1.044340
## ever_married 1.107310 1 1.052288
## work_type 1.419549 4 1.044765
## avg_glucose_level 1.109793 1 1.053467
## bmi 1.110130 1 1.053627
## smoking_status 1.107225 3 1.017121
Cara baca: Semua nilai VIF < 5 → tidak ada masalah multikolinearitas.
Regresi multinomial digunakan ketika respons punya lebih dari dua kategori yang tidak berurutan (nominal). Model membandingkan setiap kategori terhadap satu kategori referensi.
\[\log\left(\frac{P(Y=k)}{P(Y=K)}\right) = \beta_{0k} + \beta_{1k}X_1 + \cdots\]
Interpretasi lewat Relative Risk Ratio (RRR) = \(e^{\hat\beta_k}\):
Dataset: Obesity Levels — Kaggle
Pertanyaan: Faktor gaya hidup apa yang memengaruhi jenis/tipe obesitas seseorang?
Variabel Y: NObeyesdad — 7 kelas tipe
berat badan (diperlakukan nominal di sini)
Kategori referensi: Normal_Weight
obesity_raw <- read.csv(path_obesity, stringsAsFactors = FALSE)
obesity <- obesity_raw |>
dplyr::mutate(
NObeyesdad = factor(NObeyesdad),
Gender = factor(Gender),
family_history_with_overweight = factor(family_history_with_overweight),
FAVC = factor(FAVC),
CAEC = factor(CAEC),
SMOKE = factor(SMOKE),
SCC = factor(SCC),
CALC = factor(CALC),
MTRANS= factor(MTRANS)
)
# Tetapkan referensi
obesity$NObeyesdad <- relevel(obesity$NObeyesdad, ref="Normal_Weight")
cat("Jumlah baris:", nrow(obesity), "\n")## Jumlah baris: 2111
## Kategori respons: Normal_Weight Insufficient_Weight Obesity_Type_I Obesity_Type_II Obesity_Type_III Overweight_Level_I Overweight_Level_II
obesity |>
dplyr::count(NObeyesdad) |>
dplyr::mutate(persen=round(n/sum(n)*100,1)) |>
ggplot(aes(x=reorder(NObeyesdad,n), y=n, fill=NObeyesdad)) +
geom_col() +
geom_text(aes(label=paste0(n," (",persen,"%)")),
hjust=-0.1, size=3.2, color="#3d2b1f") +
coord_flip() +
scale_fill_manual(values=warna_utama) +
labs(title="Distribusi Kategori Tipe Berat Badan", x=NULL, y="Jumlah") +
theme_minimal(base_size=11) +
theme(legend.position="none",
plot.title=element_text(color="#5c3d2e", face="bold")) +
expand_limits(y=max(table(obesity$NObeyesdad))*1.3)fit_multi <- nnet::multinom(
NObeyesdad ~ Gender + Age + Height + Weight +
family_history_with_overweight + FAVC +
FCVC + NCP + SMOKE + CH2O + FAF + MTRANS,
data = obesity,
trace = FALSE
)
summary(fit_multi)## Call:
## nnet::multinom(formula = NObeyesdad ~ Gender + Age + Height +
## Weight + family_history_with_overweight + FAVC + FCVC + NCP +
## SMOKE + CH2O + FAF + MTRANS, data = obesity, trace = FALSE)
##
## Coefficients:
## (Intercept) GenderMale Age Height Weight
## Insufficient_Weight -95.70236 -1.273058 -0.08425994 128.4207 -2.460110
## Obesity_Type_I 263.32940 -5.893068 0.31526124 -367.5787 4.414481
## Obesity_Type_II 225.89468 11.783111 2.22059231 -545.3063 7.463585
## Obesity_Type_III -116.01933 -48.566005 -1.08453395 -338.5186 6.546760
## Overweight_Level_I 89.79076 -1.895187 0.00188651 -108.1353 1.342047
## Overweight_Level_II 168.52713 -1.944015 0.16869372 -224.0748 2.778598
## family_history_with_overweightyes FAVCyes FCVC
## Insufficient_Weight 1.82957271 1.1426820 1.9502407
## Obesity_Type_I 3.20423375 0.5682154 0.3731138
## Obesity_Type_II 9.93588716 -25.5192673 -2.5003571
## Obesity_Type_III -26.25658951 27.3391802 46.3914153
## Overweight_Level_I -0.06536155 1.2207695 -0.5141224
## Overweight_Level_II 2.80185202 -0.6592257 -0.1535333
## NCP SMOKEyes CH2O FAF
## Insufficient_Weight 0.81533976 -2.2213139 2.2471063 0.1775288
## Obesity_Type_I -0.38321551 0.6227479 -0.1006173 -1.0628420
## Obesity_Type_II 0.25549994 -10.3451950 -15.7188832 -7.0170587
## Obesity_Type_III 15.94889911 -9.2785165 -5.9690961 -15.9000831
## Overweight_Level_I -0.08990469 -2.3201564 -0.3302217 -0.2348865
## Overweight_Level_II -0.56368318 -2.3913814 -0.2806610 -0.5472741
## MTRANSBike MTRANSMotorbike MTRANSPublic_Transportation
## Insufficient_Weight -98.1185940 -96.4155325 -2.0055661
## Obesity_Type_I -69.3373813 0.9058887 5.1688312
## Obesity_Type_II 15.3261408 -61.5621812 21.2027563
## Obesity_Type_III 46.9222141 3.9533022 6.3612255
## Overweight_Level_I 0.7992455 -3.4189088 0.1185843
## Overweight_Level_II -130.3906301 -2.5703507 2.5815644
## MTRANSWalking
## Insufficient_Weight 3.1061742
## Obesity_Type_I -2.4993637
## Obesity_Type_II 6.0624484
## Obesity_Type_III 28.5261547
## Overweight_Level_I -0.1806994
## Overweight_Level_II -3.9610614
##
## Std. Errors:
## (Intercept) GenderMale Age Height Weight
## Insufficient_Weight 3.3882852 1.1034959 0.11027381 2.682253 0.17556006
## Obesity_Type_I 7.7412417 1.8675877 0.12042646 3.601233 0.14694835
## Obesity_Type_II 1.9913697 6.2292105 0.22056342 1.204080 0.12199060
## Obesity_Type_III 0.8108081 2.4191704 0.76749562 1.312614 0.46946770
## Overweight_Level_I 5.6882194 0.7775832 0.03979964 6.143633 0.09014147
## Overweight_Level_II 4.6768925 1.0504432 0.07427981 5.926989 0.10981510
## family_history_with_overweightyes FAVCyes FCVC
## Insufficient_Weight 1.0293479 1.0355895 0.8213216
## Obesity_Type_I 1.8910832 1.5077941 1.1718407
## Obesity_Type_II 3.1313750 2.3428184 1.8736325
## Obesity_Type_III 2.7628832 4.2410473 2.5697159
## Overweight_Level_I 0.5512685 0.6778299 0.4634032
## Overweight_Level_II 1.0009781 0.9414146 0.7561710
## NCP SMOKEyes CH2O FAF MTRANSBike
## Insufficient_Weight 0.5872139 5.8243068 0.9135471 0.4836683 NaN
## Obesity_Type_I 0.7472319 3.4237880 0.9304676 0.6949466 2.060265e-13
## Obesity_Type_II 1.2368859 7.0188747 1.9728235 1.2912772 NaN
## Obesity_Type_III 2.3865470 0.3428527 7.4987085 3.0383142 2.633521e-15
## Overweight_Level_I 0.3164128 1.9694747 0.4231180 0.2893308 2.498476e+00
## Overweight_Level_II 0.4395546 2.4277830 0.6238588 0.4158433 7.358191e-59
## MTRANSMotorbike MTRANSPublic_Transportation MTRANSWalking
## Insufficient_Weight NaN 1.0991830 5.207919e+00
## Obesity_Type_I 1.075410e+01 2.2407594 3.111064e+00
## Obesity_Type_II NaN 2.3434905 5.615480e-08
## Obesity_Type_III 2.480674e-15 4.1722832 1.400897e+00
## Overweight_Level_I 4.509724e+00 0.6960332 1.011338e+00
## Overweight_Level_II 6.179924e+00 1.2034489 1.935702e+00
##
## Residual Deviance: 415.3648
## AIC: 607.3648
z_stat <- coef(fit_multi) / summary(fit_multi)$standard.errors
p_val <- round(2*(1-pnorm(abs(z_stat))), 4)
p_val[, 1:6] |>
kable(caption="p-value Koefisien Multinomial (6 variabel pertama)") |>
kable_styling(bootstrap_options=c("striped","hover","condensed"), font_size=11)| (Intercept) | GenderMale | Age | Height | Weight | family_history_with_overweightyes | |
|---|---|---|---|---|---|---|
| Insufficient_Weight | 0 | 0.2486 | 0.4448 | 0 | 0 | 0.0755 |
| Obesity_Type_I | 0 | 0.0016 | 0.0088 | 0 | 0 | 0.0902 |
| Obesity_Type_II | 0 | 0.0585 | 0.0000 | 0 | 0 | 0.0015 |
| Obesity_Type_III | 0 | 0.0000 | 0.1576 | 0 | 0 | 0.0000 |
| Overweight_Level_I | 0 | 0.0148 | 0.9622 | 0 | 0 | 0.9056 |
| Overweight_Level_II | 0 | 0.0642 | 0.0231 | 0 | 0 | 0.0051 |
round(exp(coef(fit_multi))[, 1:5], 3) |>
kable(caption="RRR — Referensi: Normal_Weight (5 variabel pertama)") |>
kable_styling(bootstrap_options=c("striped","hover","condensed"), font_size=11)| (Intercept) | GenderMale | Age | Height | Weight | |
|---|---|---|---|---|---|
| Insufficient_Weight | 0.000000e+00 | 0.280 | 0.919 | 5.920834e+55 | 0.085 |
| Obesity_Type_I | 2.304126e+114 | 0.003 | 1.371 | 0.000000e+00 | 82.639 |
| Obesity_Type_II | 1.272951e+98 | 131020.715 | 9.213 | 0.000000e+00 | 1743.387 |
| Obesity_Type_III | 0.000000e+00 | 0.000 | 0.338 | 0.000000e+00 | 696.982 |
| Overweight_Level_I | 9.899954e+38 | 0.150 | 1.002 | 0.000000e+00 | 3.827 |
| Overweight_Level_II | 1.550254e+73 | 0.143 | 1.184 | 0.000000e+00 | 16.096 |
Template interpretasi: “Dibanding Normal_Weight, seseorang dengan [kondisi X] memiliki risiko relatif [RRR]× lebih [tinggi/rendah] untuk masuk kategori [Y], dengan variabel lain dikontrol.”
profil_multi <- data.frame(
Gender = factor(c("Male","Female"), levels=levels(obesity$Gender)),
Age=c(25,35), Height=c(1.75,1.60), Weight=c(90,75),
family_history_with_overweight=factor("yes",
levels=levels(obesity$family_history_with_overweight)),
FAVC = factor("yes", levels=levels(obesity$FAVC)),
FCVC=2, NCP=3,
SMOKE = factor("no", levels=levels(obesity$SMOKE)),
CH2O=2, FAF=1,
MTRANS= factor("Public_Transportation", levels=levels(obesity$MTRANS))
)
pred_multi <- round(predict(fit_multi, newdata=profil_multi, type="probs"), 4)
rownames(pred_multi) <- c("A: Laki-laki 25th 90kg","B: Perempuan 35th 75kg")
pred_multi |>
kable(caption="Prediksi Probabilitas Tipe Obesitas") |>
kable_styling(bootstrap_options=c("striped","hover"), font_size=11)| Normal_Weight | Insufficient_Weight | Obesity_Type_I | Obesity_Type_II | Obesity_Type_III | Overweight_Level_I | Overweight_Level_II | |
|---|---|---|---|---|---|---|---|
| A: Laki-laki 25th 90kg | 0 | 0 | 0.0235 | 0 | 0 | 0e+00 | 0.9765 |
| B: Perempuan 35th 75kg | 0 | 0 | 0.2097 | 0 | 0 | 1e-04 | 0.7902 |
as.data.frame(pred_multi) |>
tibble::rownames_to_column("Profil") |>
tidyr::pivot_longer(-Profil, names_to="Kategori", values_to="Probabilitas") |>
dplyr::mutate(Kategori=factor(Kategori, levels=c(
"Insufficient_Weight","Normal_Weight","Overweight_Level_I",
"Overweight_Level_II","Obesity_Type_I","Obesity_Type_II","Obesity_Type_III"
))) |>
ggplot(aes(x=Kategori, y=Probabilitas, fill=Profil)) +
geom_col(position="dodge") +
scale_fill_manual(values=c("#a0704a","#c9956c")) +
labs(title="Probabilitas Prediksi Tipe Obesitas per Profil", x=NULL) +
theme_minimal(base_size=10) +
theme(axis.text.x=element_text(angle=30, hjust=1),
plot.title=element_text(color="#5c3d2e", face="bold"),
legend.position="top")Regresi ordinal digunakan ketika kategori respons punya urutan yang bermakna — misalnya kurus < normal < gemuk < obesitas. Model yang paling umum adalah Proportional Odds Model:
\[\log\left(\frac{P(Y \leq k)}{P(Y > k)}\right) = \alpha_k - (\beta_1 X_1 + \cdots + \beta_p X_p)\]
Interpretasi lewat OR = \(e^{\hat\beta}\):
Dataset: Obesity Levels — Kaggle (dataset yang sama, perlakuan berbeda)
Pertanyaan: Faktor gaya hidup apa yang memengaruhi tingkat keparahan obesitas seseorang?
Variabel Y: NObeyesdad — diperlakukan
ordinal dengan urutan:
\[\text{Insufficient} < \text{Normal} < \text{Overweight I} < \text{Overweight II} < \text{Obesity I} < \text{Obesity II} < \text{Obesity III}\]
# Gunakan data obesity yang sama, tapi Y-nya dijadikan factor ordered
obesity_ord <- obesity_raw |>
dplyr::mutate(
# Definisikan urutan kategori secara eksplisit
NObeyesdad = factor(NObeyesdad,
levels = c("Insufficient_Weight","Normal_Weight",
"Overweight_Level_I","Overweight_Level_II",
"Obesity_Type_I","Obesity_Type_II","Obesity_Type_III"),
ordered = TRUE), # <-- ordered=TRUE agar R tahu ini ordinal!
Gender = factor(Gender),
family_history_with_overweight = factor(family_history_with_overweight),
FAVC = factor(FAVC),
SMOKE = factor(SMOKE),
CAEC = factor(CAEC),
CALC = factor(CALC),
SCC = factor(SCC),
MTRANS= factor(MTRANS)
)
cat("Urutan kategori:\n")## Urutan kategori:
## [1] "Insufficient_Weight" "Normal_Weight" "Overweight_Level_I"
## [4] "Overweight_Level_II" "Obesity_Type_I" "Obesity_Type_II"
## [7] "Obesity_Type_III"
obesity_ord |>
dplyr::count(NObeyesdad) |>
dplyr::mutate(persen=round(n/sum(n)*100,1)) |>
ggplot(aes(x=NObeyesdad, y=n, fill=NObeyesdad)) +
geom_col(width=0.55) +
geom_text(aes(label=paste0(n,"\n(",persen,"%)")),
vjust=-0.4, size=3.5, color="#3d2b1f") +
scale_fill_manual(values=warna_utama) +
labs(title="Distribusi Tingkat Berat Badan (Ordinal)",
x=NULL, y="Jumlah Observasi") +
theme_minimal(base_size=11) +
theme(legend.position="none",
axis.text.x=element_text(angle=25, hjust=1),
plot.title=element_text(color="#5c3d2e", face="bold"))# Scale variabel numerik agar optimisasi lebih stabil
obesity_ord_sc <- obesity_ord |>
dplyr::mutate(dplyr::across(c(Age, Height, Weight, FCVC, NCP, CH2O, FAF),
~ as.numeric(scale(.))))
# Menggunakan ordinal::clm (lebih stabil secara numerik dibanding MASS::polr)
# Keduanya menghasilkan Proportional Odds Model yang sama
fit_ordinal <- ordinal::clm(
NObeyesdad ~ Gender + Age + Height + Weight +
family_history_with_overweight + FAVC +
FCVC + NCP + SMOKE + CH2O + FAF + MTRANS,
data = obesity_ord_sc,
link = "logit"
)
summary(fit_ordinal)## formula:
## NObeyesdad ~ Gender + Age + Height + Weight + family_history_with_overweight + FAVC + FCVC + NCP + SMOKE + CH2O + FAF + MTRANS
## data: obesity_ord_sc
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 2111 -524.00 1090.00 10(0) 7.96e-11 2.1e+03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## GenderMale -1.77475 0.26740 -6.637 3.20e-11 ***
## Age 0.26650 0.10930 2.438 0.014760 *
## Height -7.78292 0.33639 -23.137 < 2e-16 ***
## Weight 25.89478 1.09454 23.658 < 2e-16 ***
## family_history_with_overweightyes 1.51972 0.26653 5.702 1.18e-08 ***
## FAVCyes 0.07068 0.25052 0.282 0.777846
## FCVC 0.19128 0.09415 2.032 0.042197 *
## NCP 0.28837 0.08413 3.428 0.000608 ***
## SMOKEyes -1.61665 0.45355 -3.564 0.000365 ***
## CH2O -0.27110 0.08682 -3.123 0.001793 **
## FAF -0.17583 0.09430 -1.865 0.062250 .
## MTRANSBike 3.43432 1.17463 2.924 0.003458 **
## MTRANSMotorbike 2.75443 1.07296 2.567 0.010255 *
## MTRANSPublic_Transportation 1.75473 0.26488 6.625 3.48e-11 ***
## MTRANSWalking 0.74913 0.56215 1.333 0.182657
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## Insufficient_Weight|Normal_Weight -28.5042 1.2997 -21.932
## Normal_Weight|Overweight_Level_I -13.2652 0.7051 -18.812
## Overweight_Level_I|Overweight_Level_II -6.6876 0.5462 -12.244
## Overweight_Level_II|Obesity_Type_I 1.9040 0.4872 3.908
## Obesity_Type_I|Obesity_Type_II 16.7557 0.8824 18.989
## Obesity_Type_II|Obesity_Type_III 27.9193 1.2427 22.467
# ordinal::clm menghasilkan kolom "z value" dan "Pr(>|z|)"
coef_tbl <- coef(summary(fit_ordinal))
# Ambil hanya baris koefisien prediktor (bukan threshold/intercept)
beta_rows <- !grepl("\\|", rownames(coef_tbl))
coef_beta <- coef_tbl[beta_rows, , drop=FALSE]
data.frame(
Variabel = rownames(coef_beta),
Koefisien = round(coef_beta[,"Estimate"], 4),
SE = round(coef_beta[,"Std. Error"], 4),
OR = round(exp(coef_beta[,"Estimate"]), 4),
`p-value` = round(coef_beta[,"Pr(>|z|)"], 4),
Signifikan = ifelse(coef_beta[,"Pr(>|z|)"] < 0.05, "✔", ""),
check.names= FALSE
) |>
kable(caption="Koefisien dan Odds Ratio — Model Ordinal") |>
kable_styling(bootstrap_options=c("striped","hover","condensed"))| Variabel | Koefisien | SE | OR | p-value | Signifikan | |
|---|---|---|---|---|---|---|
| GenderMale | GenderMale | -1.7748 | 0.2674 | 1.695000e-01 | 0.0000 | ✔ |
| Age | Age | 0.2665 | 0.1093 | 1.305400e+00 | 0.0148 | ✔ |
| Height | Height | -7.7829 | 0.3364 | 4.000000e-04 | 0.0000 | ✔ |
| Weight | Weight | 25.8948 | 1.0945 | 1.761817e+11 | 0.0000 | ✔ |
| family_history_with_overweightyes | family_history_with_overweightyes | 1.5197 | 0.2665 | 4.570900e+00 | 0.0000 | ✔ |
| FAVCyes | FAVCyes | 0.0707 | 0.2505 | 1.073200e+00 | 0.7778 | |
| FCVC | FCVC | 0.1913 | 0.0942 | 1.210800e+00 | 0.0422 | ✔ |
| NCP | NCP | 0.2884 | 0.0841 | 1.334200e+00 | 0.0006 | ✔ |
| SMOKEyes | SMOKEyes | -1.6167 | 0.4535 | 1.986000e-01 | 0.0004 | ✔ |
| CH2O | CH2O | -0.2711 | 0.0868 | 7.625000e-01 | 0.0018 | ✔ |
| FAF | FAF | -0.1758 | 0.0943 | 8.388000e-01 | 0.0622 | |
| MTRANSBike | MTRANSBike | 3.4343 | 1.1746 | 3.101040e+01 | 0.0035 | ✔ |
| MTRANSMotorbike | MTRANSMotorbike | 2.7544 | 1.0730 | 1.571200e+01 | 0.0103 | ✔ |
| MTRANSPublic_Transportation | MTRANSPublic_Transportation | 1.7547 | 0.2649 | 5.781900e+00 | 0.0000 | ✔ |
| MTRANSWalking | MTRANSWalking | 0.7491 | 0.5621 | 2.115200e+00 | 0.1827 |
Template interpretasi: “Setiap kenaikan 1 unit [variabel], odds berada di kategori berat badan lebih tinggi [meningkat/menurun] sebesar [(OR−1)×100]%, dengan variabel lain dikontrol.”
Contoh: Jika OR Weight = 1.15 → setiap kenaikan 1 kg
berat badan, odds masuk kategori lebih tinggi (lebih gemuk)
meningkat 15%.
# Helper: scale nilai baru menggunakan mean & sd dari data training
.sc <- function(x, col) (x - mean(obesity_ord[[col]], na.rm=TRUE)) / sd(obesity_ord[[col]], na.rm=TRUE)
profil_ord <- data.frame(
Gender = factor(c("Male","Female","Male"),
levels=levels(obesity_ord$Gender)),
Age = .sc(c(22,35,50), "Age"),
Height = .sc(c(1.75,1.60,1.70), "Height"),
Weight = .sc(c(65,80,100), "Weight"),
family_history_with_overweight=factor("yes",
levels=levels(obesity_ord$family_history_with_overweight)),
FAVC = factor("yes", levels=levels(obesity_ord$FAVC)),
FCVC = .sc(2, "FCVC"),
NCP = .sc(3, "NCP"),
SMOKE = factor("no", levels=levels(obesity_ord$SMOKE)),
CH2O = .sc(2, "CH2O"),
FAF = .sc(1, "FAF"),
MTRANS= factor("Public_Transportation",
levels=levels(obesity_ord$MTRANS))
)
pred_ord <- round(predict(fit_ordinal, newdata=profil_ord, type="prob")$fit, 4)
rownames(pred_ord) <- c("A: 22th 65kg","B: 35th 80kg","C: 50th 100kg")
pred_ord |>
kable(caption="Prediksi Probabilitas Tingkat Berat Badan per Profil") |>
kable_styling(bootstrap_options=c("striped","hover"), font_size=11)| Insufficient_Weight | Normal_Weight | Overweight_Level_I | Overweight_Level_II | Obesity_Type_I | Obesity_Type_II | Obesity_Type_III | |
|---|---|---|---|---|---|---|---|
| A: 22th 65kg | 0.0102 | 0.9898 | 0 | 0.0000 | 0.0000 | 0.0000 | 0 |
| B: 35th 80kg | 0.0000 | 0.0000 | 0 | 0.0212 | 0.9788 | 0.0000 | 0 |
| C: 50th 100kg | 0.0000 | 0.0000 | 0 | 0.0000 | 0.6753 | 0.3246 | 0 |
as.data.frame(pred_ord) |>
tibble::rownames_to_column("Profil") |>
tidyr::pivot_longer(-Profil, names_to="Kategori", values_to="Probabilitas") |>
dplyr::mutate(Kategori=factor(Kategori, levels=c(
"Insufficient_Weight","Normal_Weight","Overweight_Level_I",
"Overweight_Level_II","Obesity_Type_I","Obesity_Type_II","Obesity_Type_III"
))) |>
ggplot(aes(x=Kategori, y=Probabilitas, fill=Kategori)) +
geom_col() +
facet_wrap(~Profil) +
scale_fill_manual(values=warna_utama) +
labs(title="Prediksi Probabilitas Tingkat Berat Badan per Profil", x=NULL) +
theme_minimal(base_size=9) +
theme(axis.text.x=element_text(angle=35, hjust=1),
legend.position="none",
plot.title=element_text(color="#5c3d2e", face="bold"),
strip.text=element_text(color="#5c3d2e", face="bold"))Regresi Poisson digunakan untuk data hitung (count data): jumlah kejadian dalam suatu periode. Model menggunakan log link:
\[\log(\mu_i) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]
Jika tiap unit punya durasi observasi berbeda, tambahkan offset:
\[\log(\mu_i) = \log(t_i) + \beta_0 + \beta_1 X_1 + \cdots\]
Interpretasi lewat Incidence Rate Ratio (IRR) = \(e^{\hat\beta}\):
Dataset: French Motor Claims — freMTPL2freq — Kaggle
Pertanyaan: Faktor apa yang memengaruhi frekuensi klaim asuransi kendaraan bermotor?
| Variabel | Keterangan |
|---|---|
ClaimNb |
Y: jumlah klaim per polis (count) |
Exposure |
Durasi polis dalam tahun (dipakai sebagai offset) |
Area |
Zona wilayah (A–F) |
VehPower |
Tenaga kendaraan |
VehAge |
Usia kendaraan |
DrivAge |
Usia pengemudi |
BonusMalus |
Skor riwayat klaim (100=netral, >100=berisiko) |
VehBrand |
Merek kendaraan |
Density |
Kepadatan penduduk |
motor_raw <- read.csv(path_motor, stringsAsFactors = FALSE)
motor <- motor_raw |>
dplyr::mutate(
Area = factor(Area),
VehBrand = factor(VehBrand),
VehGas = factor(VehGas),
Region = factor(Region),
Exposure = pmin(Exposure, 1),
ClaimNb = as.integer(ClaimNb)
) |>
dplyr::filter(ClaimNb <= 4)
cat("Baris:", nrow(motor),
"| Rata-rata klaim:", round(mean(motor$ClaimNb),4),
"| Variansi:", round(var(motor$ClaimNb),4), "\n")## Baris: 678004 | Rata-rata klaim: 0.0531 | Variansi: 0.0564
motor |>
dplyr::count(ClaimNb) |>
dplyr::mutate(persen=round(n/sum(n)*100,2)) |>
ggplot(aes(x=factor(ClaimNb), y=n, fill=factor(ClaimNb))) +
geom_col(width=0.55) +
geom_text(aes(label=paste0(n,"\n(",persen,"%)")),
vjust=-0.3, size=3.5, color="#3d2b1f") +
scale_fill_manual(values=c("0"="#e8c9a0","1"="#c9956c",
"2"="#a0704a","3"="#7b4f2e","4"="#5c3d2e")) +
labs(title="Distribusi Jumlah Klaim Asuransi per Polis",
x="Jumlah Klaim", y="Jumlah Polis") +
theme_minimal(base_size=12) +
theme(legend.position="none",
plot.title=element_text(color="#5c3d2e", face="bold"))fit_poisson <- glm(
ClaimNb ~ Area + VehPower + VehAge + DrivAge +
BonusMalus + VehBrand + VehGas + Density +
offset(log(Exposure)),
data = motor,
family = poisson(link = "log")
)
summary(fit_poisson)##
## Call:
## glm(formula = ClaimNb ~ Area + VehPower + VehAge + DrivAge +
## BonusMalus + VehBrand + VehGas + Density + offset(log(Exposure)),
## family = poisson(link = "log"), data = motor)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.945e+00 4.093e-02 -96.392 < 2e-16 ***
## AreaB 4.474e-02 2.149e-02 2.082 0.0374 *
## AreaC 7.494e-02 1.740e-02 4.307 1.65e-05 ***
## AreaD 1.460e-01 1.841e-02 7.930 2.20e-15 ***
## AreaE 1.903e-01 2.435e-02 7.816 5.46e-15 ***
## AreaF 1.362e-01 8.810e-02 1.546 0.1220
## VehPower 1.565e-02 2.753e-03 5.687 1.30e-08 ***
## VehAge -3.875e-02 1.165e-03 -33.278 < 2e-16 ***
## DrivAge 6.549e-03 3.936e-04 16.640 < 2e-16 ***
## BonusMalus 2.254e-02 3.070e-04 73.420 < 2e-16 ***
## VehBrandB10 -6.149e-03 3.681e-02 -0.167 0.8673
## VehBrandB11 7.308e-02 3.960e-02 1.845 0.0650 .
## VehBrandB12 1.266e-01 1.671e-02 7.576 3.56e-14 ***
## VehBrandB13 3.014e-02 4.092e-02 0.737 0.4614
## VehBrandB14 -1.609e-01 7.842e-02 -2.052 0.0401 *
## VehBrandB2 -1.105e-02 1.530e-02 -0.722 0.4701
## VehBrandB3 -4.151e-03 2.187e-02 -0.190 0.8494
## VehBrandB4 -3.927e-02 2.973e-02 -1.321 0.1865
## VehBrandB5 6.250e-02 2.475e-02 2.526 0.0116 *
## VehBrandB6 -4.562e-02 2.836e-02 -1.609 0.1077
## VehGasRegular 5.295e-02 1.094e-02 4.842 1.28e-06 ***
## Density 1.626e-06 3.651e-06 0.445 0.6560
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 223632 on 678003 degrees of freedom
## Residual deviance: 216567 on 677982 degrees of freedom
## AIC: 285898
##
## Number of Fisher Scoring iterations: 6
tabel_irr <- broom::tidy(fit_poisson, exponentiate=TRUE, conf.int=TRUE) |>
dplyr::filter(term != "(Intercept)") |>
dplyr::transmute(
Variabel = term,
IRR = round(estimate, 4),
`CI 95% Bawah` = round(conf.low, 4),
`CI 95% Atas` = round(conf.high, 4),
`p-value` = round(p.value, 4),
Signifikan = ifelse(p.value < 0.05, "✔", "")
) |>
head(15)
kable(tabel_irr, caption="Incidence Rate Ratio — Model Poisson (15 teratas)") |>
kable_styling(bootstrap_options=c("striped","hover","condensed")) |>
row_spec(which(tabel_irr$Signifikan=="✔"), bold=TRUE, background="#f5ede4")| Variabel | IRR | CI 95% Bawah | CI 95% Atas | p-value | Signifikan |
|---|---|---|---|---|---|
| AreaB | 1.0458 | 1.0026 | 1.0907 | 0.0374 | ✔ |
| AreaC | 1.0778 | 1.0417 | 1.1153 | 0.0000 | ✔ |
| AreaD | 1.1572 | 1.1162 | 1.1998 | 0.0000 | ✔ |
| AreaE | 1.2096 | 1.1532 | 1.2687 | 0.0000 | ✔ |
| AreaF | 1.1459 | 0.9631 | 1.3603 | 0.1220 | |
| VehPower | 1.0158 | 1.0103 | 1.0213 | 0.0000 | ✔ |
| VehAge | 0.9620 | 0.9598 | 0.9642 | 0.0000 | ✔ |
| DrivAge | 1.0066 | 1.0058 | 1.0073 | 0.0000 | ✔ |
| BonusMalus | 1.0228 | 1.0222 | 1.0234 | 0.0000 | ✔ |
| VehBrandB10 | 0.9939 | 0.9241 | 1.0675 | 0.8673 | |
| VehBrandB11 | 1.0758 | 0.9947 | 1.1617 | 0.0650 | |
| VehBrandB12 | 1.1350 | 1.0984 | 1.1727 | 0.0000 | ✔ |
| VehBrandB13 | 1.0306 | 0.9503 | 1.1157 | 0.4614 | |
| VehBrandB14 | 0.8513 | 0.7272 | 0.9891 | 0.0401 | ✔ |
| VehBrandB2 | 0.9890 | 0.9598 | 1.0191 | 0.4701 |
Template interpretasi: “Setiap kenaikan 1 unit [variabel], rata-rata jumlah klaim [naik/turun] sebesar [(IRR−1)×100]%, dengan variabel lain dikontrol.”
motor |>
dplyr::mutate(rate_pred = fitted(fit_poisson) / Exposure) |>
dplyr::sample_n(3000) |>
ggplot(aes(x=DrivAge, y=rate_pred)) +
geom_point(alpha=0.15, color="#a0704a", size=0.7) +
geom_smooth(method="loess", color="#5c3d2e",
fill="#e8c9a0", se=TRUE) +
labs(title="Rate Klaim Prediksi vs Usia Pengemudi",
x="Usia Pengemudi (tahun)", y="Rate Klaim (klaim/tahun)") +
theme_minimal(base_size=12) +
theme(plot.title=element_text(color="#5c3d2e", face="bold"))## Rasio Deviance/df: 0.319
##
## Overdispersion test
##
## data: fit_poisson
## z = 8.0719, p-value = 3.459e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 1.179784
Cara baca: Jika p-value < 0.05 → terjadi overdispersion → gunakan Negative Binomial (
MASS::glm.nb()).
# Jalankan ini jika terjadi overdispersion:
fit_nb <- MASS::glm.nb(
ClaimNb ~ Area + VehPower + VehAge + DrivAge +
BonusMalus + VehBrand + VehGas + Density +
offset(log(Exposure)),
data = motor
)
summary(fit_nb)| Aspek | Biner | Multinomial | Ordinal | Poisson |
|---|---|---|---|---|
| Tipe Respons | 2 kategori | ≥3 nominal | ≥3 berurutan | Count (0,1,2,…) |
| Distribusi | Binomial | Multinomial | Multinomial kumulatif | Poisson |
| Link | Logit | Log-odds vs ref | Cumulative logit | Log |
| Fungsi R | glm(…, family=binomial) | nnet::multinom() | MASS::polr() | glm(…, family=poisson) |
| Koefisien dibaca sebagai | Odds Ratio (OR) | Relative Risk Ratio (RRR) | Odds Ratio (OR) | Incidence Rate Ratio (IRR) |
| Asumsi kritis | Linearitas log-odds | IIA | Proportional Odds | Equidispersion |
| Dataset | Stroke Prediction | Obesity Levels | Obesity Levels (ordered) | French Motor Claims |