Analisis Ordinal Logistic Regression digunakan untuk memodelkan hubungan antara variabel independen dengan variabel dependen yang bersifat ordinal (bertingkat). Metode ini membantu mengetahui pengaruh faktor-faktor tertentu terhadap kemungkinan suatu kategori hasil.
Sebelum analisis dilakukan, terdapat beberapa asumsi yang perlu dipenuhi, seperti tidak adanya multikolinearitas dan terpenuhinya asumsi proportional odds.
Dataset yang digunakan adalah Factors Affecting Children Anemia Level dari Kaggle (https://www.kaggle.com/datasets/adeolaadesina/factors-affecting-children-anemia-level/data),
## Age.in.5.year.groups Type.of.place.of.residence Highest.educational.level
## 1 40-44 Urban Higher
## 2 35-39 Urban Higher
## 3 25-29 Urban Higher
## 4 25-29 Urban Secondary
## 5 20-24 Urban Secondary
## Wealth.index.combined Births.in.last.five.years
## 1 Richest 1
## 2 Richest 1
## 3 Richest 1
## 4 Richest 1
## 5 Richest 1
## Age.of.respondent.at.1st.birth
## 1 22
## 2 28
## 3 26
## 4 25
## 5 21
## Hemoglobin.level.adjusted.for.altitude.and.smoking..g.dl...1.decimal.
## 1 NA
## 2 NA
## 3 NA
## 4 95
## 5 NA
## Anemia.level
## 1
## 2
## 3
## 4 Moderate
## 5
## Have.mosquito.bed.net.for.sleeping..from.household.questionnaire.
## 1 Yes
## 2 Yes
## 3 No
## 4 Yes
## 5 Yes
## Smokes.cigarettes Current.marital.status
## 1 No Living with partner
## 2 No Married
## 3 No Married
## 4 No Married
## 5 No No longer living together/separated
## Currently.residing.with.husband.partner When.child.put.to.breast
## 1 Staying elsewhere Immediately
## 2 Living with her Hours: 1
## 3 Living with her Immediately
## 4 Living with her 105.0
## 5 Immediately
## Had.fever.in.last.two.weeks
## 1 No
## 2 No
## 3 No
## 4 No
## 5 No
## Hemoglobin.level.adjusted.for.altitude..g.dl...1.decimal. Anemia.level.1
## 1 NA
## 2 NA
## 3 NA
## 4 114 Not anemic
## 5 NA
## Taking.iron.pills..sprinkles.or.syrup
## 1 Yes
## 2 No
## 3 No
## 4 No
## 5 No
# Rename kolom agar lebih mudah dipakai
colnames(df) <- c(
"age_group", "residence", "education", "wealth",
"births_5yr", "age_1st_birth", "hb_alt_smoke",
"anemia_level",
"mosquito_net", "smokes", "marital_status",
"residing_partner", "breastfeed_time",
"had_fever", "hb_alt", "anemia_level2",
"iron_pills"
)df_clean <- df %>%
select(
anemia_level, # Variabel Dependen (Y)
residence, # Prediktor kategorik
education,
wealth,
mosquito_net,
smokes,
had_fever,
iron_pills,
births_5yr, # Prediktor numerik
age_1st_birth,
hb_alt
) %>%
na.omit()## anemia_level residence education wealth mosquito_net smokes had_fever
## 4 Moderate Urban Secondary Richest Yes No No
## 6 Mild Urban Higher Richest Yes No No
## 7 Not anemic Urban Secondary Richest Yes No No
## 10 Moderate Urban Secondary Richest Yes No No
## 13 Mild Urban Higher Richest Yes No No
## iron_pills births_5yr age_1st_birth hb_alt
## 4 No 1 25 114
## 6 No 1 30 119
## 7 Yes 2 32 102
## 10 Yes 1 19 113
## 13 No 1 24 109
df_clean$anemia_level <- factor(
df_clean$anemia_level,
levels = c("Not anemic", "Mild", "Moderate", "Severe"),
ordered = TRUE
)
cat_vars <- c("residence", "education", "wealth",
"mosquito_net", "smokes", "had_fever", "iron_pills")
num_vars <- c("births_5yr", "age_1st_birth", "hb_alt")
df_clean[cat_vars] <- lapply(df_clean[cat_vars], as.factor)
str(df_clean)## 'data.frame': 10182 obs. of 11 variables:
## $ anemia_level : Ord.factor w/ 4 levels "Not anemic"<"Mild"<..: 3 2 1 3 2 2 2 1 3 3 ...
## $ residence : Factor w/ 2 levels "Rural","Urban": 2 2 2 2 2 2 2 2 2 2 ...
## $ education : Factor w/ 4 levels "Higher","No education",..: 4 1 4 4 1 1 1 1 4 4 ...
## $ wealth : Factor w/ 5 levels "Middle","Poorer",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ mosquito_net : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ smokes : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ had_fever : Factor w/ 3 levels "Don't know","No",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ iron_pills : Factor w/ 3 levels "Don't know","No",..: 2 2 3 3 2 3 2 2 3 3 ...
## $ births_5yr : int 1 1 2 1 1 2 2 2 2 2 ...
## $ age_1st_birth: int 25 30 32 19 24 19 19 22 22 22 ...
## $ hb_alt : num 114 119 102 113 109 96 111 117 96 106 ...
## - attr(*, "na.action")= 'omit' Named int [1:23742] 1 2 3 5 8 9 11 12 16 17 ...
## ..- attr(*, "names")= chr [1:23742] "1" "2" "3" "5" ...
##
## Not anemic Mild Moderate Severe
## 4163 2728 3022 149
##
## Not anemic Mild Moderate Severe
## 0.41373484 0.27111906 0.30033790 0.01480819
ggplot(df_clean, aes(x = anemia_level)) +
geom_bar(fill = "skyblue", color = "black", width = 0.7) + # tambah border hitam
labs(title = "Distribusi Anemia Level",
x = "Kategori Anemia",
y = "Frekuensi") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.text = element_text(size = 12),
axis.title = element_text(size = 13)
)num_vars <- c("births_5yr", "age_1st_birth", "hb_alt")
par(mfrow = c(1, 3))
for (var in num_vars) {
print(
ggplot(df, aes(x = .data[[var]])) +
geom_histogram(fill = "skyblue", color = "black", bins = 20) + # ← ini outline
ggtitle(paste("Distribusi", var)) +
theme_minimal() +
theme(
panel.grid.minor = element_blank()
)
)
}## Warning: Removed 23742 rows containing non-finite outside the scale range
## (`stat_bin()`).
{r}
## anemia_level residence education wealth mosquito_net smokes had_fever
## 4 Moderate Urban Secondary Richest Yes No No
## 6 Mild Urban Higher Richest Yes No No
## 7 Not anemic Urban Secondary Richest Yes No No
## 10 Moderate Urban Secondary Richest Yes No No
## 13 Mild Urban Higher Richest Yes No No
## iron_pills births_5yr age_1st_birth hb_alt
## 4 No 1 25 114
## 6 No 1 30 119
## 7 Yes 2 32 102
## 10 Yes 1 19 113
## 13 No 1 24 109
Uji Chi-Square dilakukan untuk mengetahui apakah terdapat hubungan yang signifikan antara masing-masing variabel kategorik dengan variabel dependen anemia_level. Hipotesis yang digunakan: H₀: Tidak ada hubungan antara variabel prediktor dengan anemia_level H₁: Terdapat hubungan antara variabel prediktor dengan anemia_level Keputusan: Tolak H₀ jika p-value < 0,05.
## 'data.frame': 10182 obs. of 11 variables:
## $ anemia_level : Ord.factor w/ 4 levels "Not anemic"<"Mild"<..: 3 2 1 3 2 2 2 1 3 3 ...
## $ residence : Factor w/ 2 levels "Rural","Urban": 2 2 2 2 2 2 2 2 2 2 ...
## $ education : Factor w/ 4 levels "Higher","No education",..: 4 1 4 4 1 1 1 1 4 4 ...
## $ wealth : Factor w/ 5 levels "Middle","Poorer",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ mosquito_net : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ smokes : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ had_fever : Factor w/ 3 levels "Don't know","No",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ iron_pills : Factor w/ 3 levels "Don't know","No",..: 2 2 3 3 2 3 2 2 3 3 ...
## $ births_5yr : int 1 1 2 1 1 2 2 2 2 2 ...
## $ age_1st_birth: int 25 30 32 19 24 19 19 22 22 22 ...
## $ hb_alt : num 114 119 102 113 109 96 111 117 96 106 ...
## - attr(*, "na.action")= 'omit' Named int [1:23742] 1 2 3 5 8 9 11 12 16 17 ...
## ..- attr(*, "names")= chr [1:23742] "1" "2" "3" "5" ...
for (var in cat_vars) {
tbl <- table(df_clean[[var]], df_clean$anemia_level)
chi_result <- chisq.test(tbl, simulate.p.value = TRUE, B = 2000)
cat(sprintf(
"\n%-20s | Chi-sq = %8.4f | df = %2d | p-value = %.4f %s\n",
var,
chi_result$statistic,
chi_result$parameter,
chi_result$p.value,
ifelse(chi_result$p.value < 0.05, "SIGNIFIKAN (ada hubungan)", "Tidak Signifikan")
))
}##
## residence | Chi-sq = 68.3951 | df = NA | p-value = 0.0005 SIGNIFIKAN (ada hubungan)
##
## education | Chi-sq = 97.2525 | df = NA | p-value = 0.0005 SIGNIFIKAN (ada hubungan)
##
## wealth | Chi-sq = 148.7527 | df = NA | p-value = 0.0005 SIGNIFIKAN (ada hubungan)
##
## mosquito_net | Chi-sq = 2.9287 | df = NA | p-value = 0.4183 Tidak Signifikan
##
## smokes | Chi-sq = 2.3223 | df = NA | p-value = 0.4543 Tidak Signifikan
##
## had_fever | Chi-sq = 17.9261 | df = NA | p-value = 0.0295 SIGNIFIKAN (ada hubungan)
##
## iron_pills | Chi-sq = 14.6927 | df = NA | p-value = 0.0305 SIGNIFIKAN (ada hubungan)
Matriks korelasi Pearson digunakan untuk melihat kekuatan hubungan linear antar variabel numerik. Hasil menunjukkan tidak adanya korelasi tinggi antar variabel numerik (nilai korelasi tertinggi hanya 0,149 antara hb_alt dan age_1st_birth), sehingga tidak terdapat masalah multikolinearitas di antara variabel numerik.
num_vars <- c("births_5yr", "age_1st_birth", "hb_alt")
cor_matrix <- cor(df_clean[, num_vars], method = "pearson")
cat("\n--- a) Matriks Korelasi Pearson (Variabel Numerik) ---\n")##
## --- a) Matriks Korelasi Pearson (Variabel Numerik) ---
## births_5yr age_1st_birth hb_alt
## births_5yr 1.000 -0.033 -0.026
## age_1st_birth -0.033 1.000 0.149
## hb_alt -0.026 0.149 1.000
df_long_num <- df_clean %>%
select(all_of(num_vars)) %>%
pivot_longer(cols = everything(), names_to = "variabel", values_to = "nilai")
ggplot(df_long_num, aes(x = variabel, y = nilai, fill = variabel)) +
geom_boxplot(outlier.colour = "red", outlier.size = 1.2) +
facet_wrap(~variabel, scales = "free") +
theme_bw() +
labs(title = "Boxplot Variabel Numerik (Deteksi Outlier)",
x = NULL, y = "Nilai") +
theme(legend.position = "none")num_data <- df_clean[, num_vars]
mah_dist <- mahalanobis(num_data,
center = colMeans(num_data),
cov = cov(num_data))
cutoff <- qchisq(0.999, df = length(num_vars))
n_mah_out <- sum(mah_dist > cutoff)
pct_mah <- round(n_mah_out / nrow(df_clean) * 100, 2)
cat(sprintf(" Cut-off Chi-square (p=0.001, df=%d): %.4f\n",
length(num_vars), cutoff))## Cut-off Chi-square (p=0.001, df=3): 16.2662
## Jumlah outlier multivariat : 59 (0.58%)
df_long_num <- df_no_outlier %>%
select(all_of(num_vars)) %>%
pivot_longer(cols = everything(), names_to = "variabel", values_to = "nilai")
ggplot(df_long_num, aes(x = variabel, y = nilai, fill = variabel)) +
geom_boxplot(outlier.colour = "red", outlier.size = 1.2) +
facet_wrap(~variabel, scales = "free") +
theme_bw() +
labs(title = "Boxplot Variabel Numerik (Deteksi Outlier)",
x = NULL, y = "Nilai") +
theme(legend.position = "none")df_iqr <- df_clean
for(col in num_vars){
Q1 <- quantile(df_iqr[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(df_iqr[[col]], 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val
df_iqr <- df_iqr[
df_iqr[[col]] >= lower &
df_iqr[[col]] <= upper, ]
}
nrow(df_iqr)## [1] 9556
df_long_num <- df_iqr %>%
select(all_of(num_vars)) %>%
pivot_longer(cols = everything(), names_to = "variabel", values_to = "nilai")
ggplot(df_long_num, aes(x = variabel, y = nilai, fill = variabel)) +
geom_boxplot(outlier.colour = "red", outlier.size = 1.2) +
facet_wrap(~variabel, scales = "free") +
theme_bw() +
labs(title = "Boxplot Variabel Numerik (Deteksi Outlier)",
x = NULL, y = "Nilai") +
theme(legend.position = "none")Uji independensi digunakan untuk mengetahui ada tidaknya hubungan antara variabel independen dan dependen sebelum dilakukan pemodelan regresi logistik ordinal. Pengujian ini menggunakan uji Chi-Square dengan hipotesis: - H0: tidak ada hubungan - H1: terdapat hubungan Keputusan: H0 ditolak jika p-value < 0,05 yang berarti terdapat hubungan signifikan, dan sebaliknya jika p-value > 0,05 maka tidak signifikan.
for (var in cat_vars) {
tbl <- table(df_iqr[[var]], df_iqr$anemia_level)
chi_result <- chisq.test(tbl, simulate.p.value = TRUE, B = 2000)
cat(sprintf(
"\n%-20s | Chi-sq = %8.4f | df = %2d | p-value = %.4f %s\n",
var,
chi_result$statistic,
chi_result$parameter,
chi_result$p.value,
ifelse(chi_result$p.value < 0.05, "SIGNIFIKAN (ada hubungan)", "Tidak Signifikan")
))
}##
## residence | Chi-sq = 44.1499 | df = NA | p-value = 0.0005 SIGNIFIKAN (ada hubungan)
##
## education | Chi-sq = 68.6751 | df = NA | p-value = 0.0005 SIGNIFIKAN (ada hubungan)
##
## wealth | Chi-sq = 114.1504 | df = NA | p-value = 0.0005 SIGNIFIKAN (ada hubungan)
##
## mosquito_net | Chi-sq = 3.0267 | df = NA | p-value = 0.3848 Tidak Signifikan
##
## smokes | Chi-sq = 2.5594 | df = NA | p-value = 0.4268 Tidak Signifikan
##
## had_fever | Chi-sq = 14.6603 | df = NA | p-value = 0.0380 SIGNIFIKAN (ada hubungan)
##
## iron_pills | Chi-sq = 12.0207 | df = NA | p-value = 0.0660 Tidak Signifikan
Uji Variance Inflation Factor (VIF) dilakukan pada model regresi OLS proxy untuk mendeteksi multikolinearitas antar prediktor. Hipotesis:
- H₀: Tidak terdapat multikolinearitas (VIF < 10) - H₁: Terdapat multikolinearitas (VIF ≥ 10) Seluruh nilai GVIF1/(2·Df) berada di bawah 2 (nilai tertinggi 1,174 pada variabel residence), yang jauh di bawah ambang batas 10. Dengan demikian, tidak terdapat masalah multikolinearitas pada model dan asumsi ini terpenuhi.
ols_proxy <- lm(
as.numeric(anemia_level) ~ residence + education + wealth + had_fever + iron_pills +
births_5yr + age_1st_birth + hb_alt,
data = df_clean
)
vif_result <- vif(ols_proxy)
cat("VIF per Variabel:\n")## VIF per Variabel:
## GVIF Df GVIF^(1/(2*Df))
## residence 1.377 1 1.174
## education 2.118 3 1.133
## wealth 2.287 4 1.109
## had_fever 1.046 2 1.011
## iron_pills 1.059 2 1.015
## births_5yr 1.005 1 1.003
## age_1st_birth 1.353 1 1.163
## hb_alt 1.090 1 1.044
model_olr <- polr(
anemia_level ~ residence + education + wealth + had_fever +
births_5yr + age_1st_birth + hb_alt,
data = df_model,
Hess = TRUE,
method = "logistic"
)
cat("Jumlah observasi dalam model:", nrow(model_olr$model), "\n")## Jumlah observasi dalam model: 9442
## Call:
## polr(formula = anemia_level ~ residence + education + wealth +
## had_fever + births_5yr + age_1st_birth + hb_alt, data = df_model,
## Hess = TRUE, method = "logistic")
##
## Coefficients:
## residenceUrban educationNo education educationPrimary
## -0.070697657 0.123784826 0.072717435
## educationSecondary wealthPoorer wealthPoorest
## 0.108035382 0.006636448 0.122370261
## wealthRicher wealthRichest had_feverNo
## -0.010587232 -0.244232107 0.547344589
## had_feverYes births_5yr age_1st_birth
## 0.559700154 0.026155743 0.004013780
## hb_alt
## -0.021324616
##
## Intercepts:
## Not anemic|Mild Mild|Moderate Moderate|Severe
## -1.8089853 -0.6453711 2.8896855
##
## Residual Deviance: 21183.59
## AIC: 21215.59
## (114 observations deleted due to missingness)
Uji serentak digunakan untuk menguji secara bersama-sama apakah seluruh variabel prediktor berpengaruh terhadap tingkat anemia pada anak.
Hipotesis:
- H0 : Semua koefisien β = 0 (semua prediktor tidak berpengaruh)
- H1 : Minimal ada satu β ≠ 0
Keputusan: Tolak H0 jika p-value < 0,05
## Likelihood ratio tests of ordinal regression models
##
## Response: anemia_level
## Model
## 1 1
## 2 residence + education + wealth + had_fever + births_5yr + age_1st_birth + hb_alt
## Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
## 1 9439 21542.52
## 2 9426 21183.59 1 vs 2 13 358.9286 0
lrt_stat <- lrt$`LR stat.`[2]
lrt_df <- lrt$` Df`[2]
lrt_p <- lrt$`Pr(Chi)`[2]
cat(sprintf("\nG² = %.4f\n", lrt_stat))##
## G² = 358.9286
## df = 13
## p = 0.000000
cat(sprintf("Keputusan: %s\n",
ifelse(lrt_p < 0.05,
"Tolak H0 — model signifikan secara serentak",
"Gagal Tolak H0 — model tidak signifikan")))## Keputusan: Tolak H0 — model signifikan secara serentak
Uji parsial digunakan untuk menguji pengaruh masing-masing variabel prediktor secara individu terhadap tingkat anemia pada anak. Hipotesis:
- H0 : βk = 0 (prediktor ke-k tidak berpengaruh)
- H1 : βk ≠ 0 (prediktor ke-k berpengaruh)
Keputusan: Tolak H0 jika p-value < 0,05
coef_tbl <- coef(summary(model_olr))
# Pisahkan koefisien prediktor dan threshold
coef_pred <- coef_tbl[!rownames(coef_tbl) %in% names(model_olr$zeta), ]
coef_threshold <- coef_tbl[rownames(coef_tbl) %in% names(model_olr$zeta), ]
# Hitung p-value
p_values <- 2 * pnorm(abs(coef_pred[, "t value"]), lower.tail = FALSE)
# Gabungkan ke tabel hasil
hasil_parsial <- cbind(
round(coef_pred, 4),
"p value" = round(p_values, 4),
"keputusan" = ifelse(p_values < 0.05, "Signifikan", "Tidak Signifikan")
)
print(hasil_parsial)## Value Std. Error t value p value
## residenceUrban "-0.0707" "0.0462" "-1.5311" "0.1257"
## educationNo education "0.1238" "0.0946" "1.3079" "0.1909"
## educationPrimary "0.0727" "0.0947" "0.7678" "0.4426"
## educationSecondary "0.108" "0.0822" "1.3142" "0.1888"
## wealthPoorer "0.0066" "0.0608" "0.1091" "0.9132"
## wealthPoorest "0.1224" "0.0648" "1.8886" "0.0589"
## wealthRicher "-0.0106" "0.0597" "-0.1774" "0.8592"
## wealthRichest "-0.2442" "0.0714" "-3.4192" "6e-04"
## had_feverNo "0.5473" "0.0789" "6.9403" "0"
## had_feverYes "0.5597" "0.077" "7.2725" "0"
## births_5yr "0.0262" "0.0298" "0.8778" "0.3801"
## age_1st_birth "0.004" "0.0058" "0.6912" "0.4894"
## hb_alt "-0.0213" "0.0014" "-15.2388" "0"
## keputusan
## residenceUrban "Tidak Signifikan"
## educationNo education "Tidak Signifikan"
## educationPrimary "Tidak Signifikan"
## educationSecondary "Tidak Signifikan"
## wealthPoorer "Tidak Signifikan"
## wealthPoorest "Tidak Signifikan"
## wealthRicher "Tidak Signifikan"
## wealthRichest "Signifikan"
## had_feverNo "Signifikan"
## had_feverYes "Signifikan"
## births_5yr "Tidak Signifikan"
## age_1st_birth "Tidak Signifikan"
## hb_alt "Signifikan"
n <- nrow(df_model)
ll_full <- as.numeric(logLik(model_olr))
ll_null <- as.numeric(logLik(model_null))
r2_mcf <- round(1 - (ll_full / ll_null), 4)
r2_cox <- round(1 - exp((2 / n) * (ll_null - ll_full)), 4)
r2_max <- 1 - exp((2 / n) * ll_null)
r2_nagel <- round(r2_cox / r2_max, 4)
cat("McFadden R² :", r2_mcf, "\n")## McFadden R² : 0.0167
## Cox-Snell R² : 0.0369
## Nagelkerke R² : 0.0412
Nilai Nagelkerke R² menunjukkan bahwa variabel-variabel prediktor dalam model hanya mampu menjelaskan sekitar 4,12% variasi pada variabel respon. Hal ini mengindikasikan bahwa kemampuan model dalam menjelaskan data masih relatif rendah, sehingga kemungkinan terdapat faktor lain di luar model yang lebih berpengaruh terhadap variabel respon.