Small Area Estimation (SAE) adalah teknik statistik yang dirancang untuk menghasilkan estimasi yang handal pada domain kecil (area kecil) di mana ukuran sampel terlalu kecil untuk menghasilkan estimasi langsung yang presisi. Pendekatan ini sangat relevan dalam konteks statistik resmi (official statistics), di mana data survei nasional perlu “diperkecil” ke level kabupaten/kota atau bahkan kecamatan.
Ketika peubah respons bukan data kontinu dan tidak berdistribusi normal — misalnya data cacahan (count data) seperti jumlah kejadian penyakit, jumlah penduduk miskin, atau jumlah kelahiran — model regresi linier biasa tidak lagi tepat. Pada kondisi ini, Generalized Linear Mixed Model (GLMM) dengan distribusi Poisson menjadi alat yang sesuai.
Pendekatan Empirical Bayes (EB) dalam SAE memanfaatkan model GLMM untuk “meminjam kekuatan” (borrowing strength) dari area-area lain sehingga estimasi untuk area dengan sampel kecil dapat diperbaiki kualitasnya.
Dokumen ini bertujuan untuk:
Model Linier (LM) adalah model regresi untuk peubah respons yang berdistribusi normal. Generalized Linear Model (GLM) merupakan pengembangan LM untuk peubah respons yang tidak berdistribusi normal, selama distribusinya termasuk dalam keluarga eksponensial.
GLM memiliki tiga komponen utama:
| Komponen | Keterangan |
|---|---|
| Komponen Acak | Peubah respons \(Y\) berdistribusi dari keluarga eksponensial |
| Komponen Sistematik | Prediktor linier \(\eta = \mathbf{X}\boldsymbol{\beta}\) |
| Fungsi Penghubung | Menghubungkan \(\mu = E(Y)\) dengan \(\eta\): \(g(\mu) = \eta\) |
Beberapa distribusi dan fungsi penghubung yang umum digunakan:
| Distribusi | Fungsi Penghubung | Persamaan |
|---|---|---|
| Normal | Identity | \(\mu = \eta\) |
| Binomial | Logit | \(\log\frac{\mu}{1-\mu} = \eta\) |
| Poisson | Log | \(\log(\mu) = \eta\) |
| Gamma | Inverse | \(1/\mu = \eta\) |
GLMM diperoleh dengan menambahkan efek acak (random effect) pada GLM (Datta & Ghosh, 1991). Model ini menggabungkan:
Secara umum, model GLMM dapat ditulis sebagai:
\[g(\mu_{ij}) = \mathbf{x}_{ij}^T \boldsymbol{\beta} + \mathbf{z}_{ij}^T \mathbf{u}_i\]
di mana \(\mathbf{u}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{G})\) adalah vektor efek acak untuk area \(i\).
Untuk data cacahan \(Y_{ij}\) (observasi \(j\) pada area \(i\)), GLMM Poisson dengan random intercept didefinisikan sebagai:
\[Y_{ij} \mid u_i \sim \text{Poisson}(\lambda_{ij})\]
\[\log(\lambda_{ij}) = \beta_0 + \beta_1 x_{1ij} + \beta_2 x_{2ij} + u_i\]
\[u_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma_u^2), \quad i = 1, \ldots, m\]
Model ini memiliki dua level:
Metode Empirical Bayes (EB) mengasumsikan bahwa distribusi prior tidak diketahui sepenuhnya, kemudian data digunakan untuk menduga parameter prior (Rao, 2003).
Langkah-langkah EB menurut Rao (2003):
Untuk model GLMM Poisson, estimasi EB untuk area \(i\) adalah:
\[\hat{\theta}_i^{EB} = \frac{1}{n_i} \sum_{j=1}^{n_i} \hat{\lambda}_{ij}\]
di mana \(\hat{\lambda}_{ij} = \exp(\hat{\beta}_0 + \hat{\beta}_1 x_{1ij} + \hat{\beta}_2 x_{2ij} + \hat{u}_i)\) adalah prediksi dengan efek acak.
Mengapa EB lebih baik dari estimasi langsung?
Estimasi langsung (rata-rata sampel) hanya menggunakan data dari area tersebut, sehingga pada area dengan sampel kecil hasilnya tidak presisi (RSE tinggi). Estimasi EB “meminjam kekuatan” dari model yang dibangun menggunakan seluruh data, sehingga menghasilkan estimasi lebih stabil.
MSE dari estimasi EB mengukur akurasi estimasi. Dalam praktik, MSE sulit dihitung secara analitik sehingga digunakan pendekatan numerik:
Prosedur Jackknife Leave-One-Area-Out:
\[\widehat{\text{MSE}}_j^{JK} = \frac{m-1}{m} \sum_{i^*=1}^{m} \left(\hat{\theta}_j^{(-i^*)} - \bar{\theta}_j^{(-\cdot)}\right)^2\]
Prosedur Bootstrap:
\[\text{RSE}_i^{Boot} = 100 \times \frac{\text{sd}(\hat{\theta}_i^{*(b)})}{\bar{\hat{\theta}}_i^{*(b)}}\]
Data disimulasikan untuk mencerminkan skenario riil SAE:
kategori (A, B, C — misalnya strata wilayah)set.seed(123)
n_area <- 20
obs_per_area <- 10
n <- n_area * obs_per_area
# Variabel penjelas
area <- factor(rep(1:n_area, each = obs_per_area))
x1 <- runif(n, 0, 10)
x2 <- runif(n, 5, 15)
kategori <- factor(sample(c("A", "B", "C"), n, replace = TRUE))
# Random intercept per area
area_effect <- rnorm(n_area, mean = 0, sd = 0.5)
random_intercept <- area_effect[as.integer(area)]
# True lambda dan respon Poisson
lambda <- exp(1 + 0.3 * x1 - 0.2 * x2 + random_intercept)
y <- rpois(n, lambda)
data_sae <- data.frame(area, x1, x2, kategori, y)summary_df <- data_sae %>%
summarise(
N_total = n(),
N_area = n_distinct(area),
Mean_y = round(mean(y), 3),
SD_y = round(sd(y), 3),
Min_y = min(y),
Max_y = max(y),
Mean_x1 = round(mean(x1), 3),
Mean_x2 = round(mean(x2), 3)
)
kable(summary_df, caption = "Statistik Deskriptif Data Simulasi",
booktabs = TRUE, align = "c") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| N_total | N_area | Mean_y | SD_y | Min_y | Max_y | Mean_x1 | Mean_x2 |
|---|---|---|---|---|---|---|---|
| 200 | 20 | 2.68 | 3.299 | 0 | 20 | 5.064 | 9.891 |
p1 <- ggplot(data_sae, aes(x = y)) +
geom_histogram(aes(y = after_stat(density)), bins = 20,
fill = "steelblue", alpha = 0.7, color = "white") +
geom_density(color = "darkred", linewidth = 1) +
labs(title = "Distribusi Variabel Respons Y",
x = "Y (Count)", y = "Densitas") +
theme_minimal(base_size = 12)
p2 <- data_sae %>%
group_by(area) %>%
summarise(mean_y = mean(y), .groups = "drop") %>%
ggplot(aes(x = area, y = mean_y)) +
geom_col(fill = "steelblue", alpha = 0.8) +
geom_hline(yintercept = mean(data_sae$y), color = "tomato",
linetype = "dashed", linewidth = 1) +
labs(title = "Rata-rata Y per Area",
subtitle = "Garis merah = rata-rata keseluruhan",
x = "Area", y = "Rata-rata Y") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p1 | p2Figure 3.1: Distribusi variabel respons Y per area. Variasi antar area terlihat jelas, mencerminkan adanya random effect.
p3 <- ggplot(data_sae, aes(x = x1, y = y, color = kategori)) +
geom_point(alpha = 0.6, size = 1.5) +
geom_smooth(method = "loess", se = FALSE, linewidth = 1) +
scale_color_brewer(palette = "Set1") +
labs(title = "Y vs x1 per Kategori",
x = "x1", y = "Y", color = "Kategori") +
theme_minimal(base_size = 11)
p4 <- ggplot(data_sae, aes(x = x2, y = y, color = kategori)) +
geom_point(alpha = 0.6, size = 1.5) +
geom_smooth(method = "loess", se = FALSE, linewidth = 1) +
scale_color_brewer(palette = "Set1") +
labs(title = "Y vs x2 per Kategori",
x = "x2", y = "Y", color = "Kategori") +
theme_minimal(base_size = 11)
p3 | p4Figure 3.2: Hubungan antara kovariat (x1, x2) dengan variabel respons Y, diwarnai berdasarkan kategori.
Interpretasi Eksplorasi Data:
Histogram menunjukkan distribusi Y yang menjulur ke kanan (right-skewed), konsisten dengan data cacahan Poisson. Rata-rata Y antar area bervariasi cukup signifikan (terlihat dari bar plot), mengindikasikan adanya random effect antar area yang perlu dimodelkan. Hubungan Y dengan \(x_1\) cenderung positif sedangkan dengan \(x_2\) cenderung negatif, sesuai dengan parameter simulasi yang digunakan.
Model GLMM Poisson yang diestimasi adalah:
\[\log(\lambda_{ij}) = \beta_0 + \beta_1 x_{1ij} + \beta_2 x_{2ij} + \beta_3 \cdot \mathbb{1}[\text{kat}=B] + \beta_4 \cdot \mathbb{1}[\text{kat}=C] + u_i\]
\[u_i \sim \mathcal{N}(0, \sigma_u^2)\]
Estimasi dilakukan menggunakan paket glmmTMB yang menggunakan pendekatan Maximum Likelihood dengan integrasi numerik (Laplace approximation).
## Family: poisson ( log )
## Formula: y ~ x1 + x2 + kategori + (1 | area)
## Data: data_sae
##
## AIC BIC logLik -2*log(L) df.resid
## 677.3 697.1 -332.7 665.3 194
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## area (Intercept) 0.09663 0.3108
## Number of obs: 200, groups: area, 20
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.55314 0.24211 2.285 0.0223 *
## x1 0.30276 0.01992 15.198 <2e-16 ***
## x2 -0.16822 0.01818 -9.254 <2e-16 ***
## kategoriB 0.13909 0.11583 1.201 0.2298
## kategoriC 0.05749 0.11648 0.494 0.6216
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef_df <- as.data.frame(smry$coefficients$cond)
coef_df$Parameter <- rownames(coef_df)
rownames(coef_df) <- NULL
coef_df <- coef_df[, c("Parameter", "Estimate", "Std. Error", "z value", "Pr(>|z|)")]
coef_df[, 2:5] <- round(coef_df[, 2:5], 4)
colnames(coef_df) <- c("Parameter", "Estimasi", "Std. Error", "z-value", "p-value")
kable(coef_df, caption = "Estimasi Koefisien Fixed Effect GLMM Poisson",
booktabs = TRUE, align = "c") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which(coef_df$`p-value` < 0.05), bold = TRUE, color = "white",
background = "#2c7bb6")| Parameter | Estimasi | Std. Error | z-value | p-value |
|---|---|---|---|---|
| (Intercept) | 0.5531 | 0.2421 | 2.2846 | 0.0223 |
| x1 | 0.3028 | 0.0199 | 15.1981 | 0.0000 |
| x2 | -0.1682 | 0.0182 | -9.2543 | 0.0000 |
| kategoriB | 0.1391 | 0.1158 | 1.2008 | 0.2298 |
| kategoriC | 0.0575 | 0.1165 | 0.4936 | 0.6216 |
re_var <- VarCorr(model_glmm)
sigma_u <- attr(re_var$cond$area, "stddev")
cat("Varians Random Effect (σ²_u) :", round(sigma_u^2, 4), "\n")## Varians Random Effect (σ²_u) : 0.0966
## Std. Dev. Random Effect (σ_u) : 0.3108
## Nilai σ_u simulasi (true) : 0.5
## Selisih (|estimasi - true|) : 0.1892
re_vals <- ranef(model_glmm)$cond$area
re_df <- data.frame(
area = rownames(re_vals),
BLUP = re_vals[, 1],
true_u = area_effect
)
ggplot(re_df, aes(x = reorder(area, BLUP))) +
geom_hline(yintercept = 0, color = "gray50", linetype = "dashed") +
geom_segment(aes(xend = reorder(area, BLUP), y = 0, yend = BLUP),
color = "steelblue", linewidth = 0.8) +
geom_point(aes(y = BLUP), color = "steelblue", size = 2.5) +
geom_point(aes(y = true_u), color = "tomato", size = 2, shape = 17) +
labs(title = "BLUP Random Intercept per Area",
subtitle = "Biru = estimasi BLUP | Merah (▲) = nilai true (simulasi)",
x = "Area", y = "Random Intercept (u_i)") +
theme_minimal(base_size = 11) +
coord_flip()Figure 4.1: BLUP (Best Linear Unbiased Prediction) random intercept per area. Area di atas nol memiliki kecenderungan count lebih tinggi dari rata-rata.
b <- fixef(model_glmm)$cond
IRR <- exp(b) # Incidence Rate Ratio
irr_df <- data.frame(
Parameter = names(IRR),
Estimasi = round(b, 4),
IRR = round(IRR, 4),
CI_lower = round(exp(b - 1.96 * coef_df$`Std. Error`), 4),
CI_upper = round(exp(b + 1.96 * coef_df$`Std. Error`), 4)
)
kable(irr_df, caption = "Incidence Rate Ratio (IRR) Koefisien Fixed Effect",
booktabs = TRUE, align = "c",
col.names = c("Parameter", "Estimasi (log)", "IRR", "CI Lower 95%", "CI Upper 95%")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Parameter | Estimasi (log) | IRR | CI Lower 95% | CI Upper 95% | |
|---|---|---|---|---|---|
| (Intercept) | (Intercept) | 0.5531 | 1.7387 | 1.0818 | 2.7945 |
| x1 | x1 | 0.3028 | 1.3536 | 1.3018 | 1.4074 |
| x2 | x2 | -0.1682 | 0.8452 | 0.8156 | 0.8759 |
| kategoriB | kategoriB | 0.1391 | 1.1492 | 0.9159 | 1.4420 |
| kategoriC | kategoriC | 0.0575 | 1.0592 | 0.8429 | 1.3309 |
Interpretasi IRR:
- Intercept: Ketika semua kovariat = 0, ekspektasi count adalah \(\exp(\hat{\beta}_0)\).
- x1 (IRR ≈ 1.30): Setiap kenaikan 1 unit \(x_1\) meningkatkan ekspektasi count sekitar 30%, ceteris paribus (IRR > 1 → efek positif).
- x2 (IRR ≈ 0.82): Setiap kenaikan 1 unit \(x_2\) menurunkan ekspektasi count sekitar 18%, ceteris paribus (IRR < 1 → efek negatif).
- kategoriB & kategoriC: Perbedaan rata-rata count relatif terhadap kategori A (referensi).
data_sae$pred_lambda <- predict(model_glmm, type = "response", re.form = NULL)
EB_main <- data_sae %>%
group_by(area) %>%
summarise(EB_estimate = mean(pred_lambda), .groups = "drop")
kable(EB_main, caption = "Estimasi Empirical Bayes (EB) per Area",
booktabs = TRUE, align = "c",
col.names = c("Area", "EB Estimate (λ̂)"),
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Area | EB Estimate (λ̂) |
|---|---|
| 1 | 2.1601 |
| 2 | 3.3671 |
| 3 | 3.0399 |
| 4 | 3.5204 |
| 5 | 1.1504 |
| 6 | 3.6833 |
| 7 | 2.4561 |
| 8 | 1.4332 |
| 9 | 3.9729 |
| 10 | 1.7978 |
| 11 | 2.6098 |
| 12 | 4.8768 |
| 13 | 1.6754 |
| 14 | 2.9117 |
| 15 | 1.1161 |
| 16 | 2.5394 |
| 17 | 2.2991 |
| 18 | 2.5243 |
| 19 | 2.9766 |
| 20 | 3.2867 |
direct_est <- data_sae %>%
group_by(area) %>%
summarise(
y_direct = mean(y),
SE_direct = sd(y) / sqrt(n()),
.groups = "drop"
) %>%
mutate(RSE_direct = 100 * SE_direct / y_direct)
kable(direct_est, caption = "Estimasi Langsung (Direct Estimate) per Area",
booktabs = TRUE, align = "c",
col.names = c("Area", "Rata-rata Y", "SE Direct", "RSE Direct (%)"),
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
column_spec(4, color = ifelse(direct_est$RSE_direct > 25, "red", "black"),
bold = ifelse(direct_est$RSE_direct > 25, TRUE, FALSE))| Area | Rata-rata Y | SE Direct | RSE Direct (%) |
|---|---|---|---|
| 1 | 1.8 | 0.5538 | 30.7653 |
| 2 | 3.7 | 1.0960 | 29.6203 |
| 3 | 3.1 | 1.2423 | 40.0745 |
| 4 | 3.6 | 1.2927 | 35.9087 |
| 5 | 1.2 | 0.4422 | 36.8514 |
| 6 | 4.3 | 1.4457 | 33.6205 |
| 7 | 2.3 | 0.8570 | 37.2608 |
| 8 | 1.1 | 0.3786 | 34.4176 |
| 9 | 3.8 | 1.3565 | 35.6965 |
| 10 | 1.9 | 0.7810 | 41.1066 |
| 11 | 2.4 | 0.9452 | 39.3818 |
| 12 | 5.1 | 2.1158 | 41.4865 |
| 13 | 1.3 | 0.6675 | 51.3461 |
| 14 | 2.9 | 0.7371 | 25.4176 |
| 15 | 0.8 | 0.4163 | 52.0416 |
| 16 | 3.0 | 1.0954 | 36.5148 |
| 17 | 2.5 | 0.5821 | 23.2857 |
| 18 | 2.6 | 0.9684 | 37.2457 |
| 19 | 2.8 | 0.8537 | 30.4911 |
| 20 | 3.4 | 1.1372 | 33.4485 |
Catatan: Area dengan RSE > 25% (ditampilkan merah & tebal) mengindikasikan estimasi langsung yang tidak presisi dan merupakan kandidat utama yang diuntungkan oleh pendekatan EB.
compare_df <- left_join(direct_est, EB_main, by = "area") %>%
mutate(area_num = as.numeric(as.character(area)))
ggplot(compare_df, aes(x = area_num)) +
geom_line(aes(y = y_direct, color = "Direct"), linewidth = 1) +
geom_line(aes(y = EB_estimate, color = "Empirical Bayes"), linewidth = 1,
linetype = "dashed") +
geom_point(aes(y = y_direct, color = "Direct"), size = 2) +
geom_point(aes(y = EB_estimate, color = "Empirical Bayes"), size = 2) +
scale_color_manual(values = c("Direct" = "tomato",
"Empirical Bayes" = "steelblue")) +
labs(title = "Perbandingan Estimasi: Direct vs Empirical Bayes",
subtitle = "Per area (n_obs = 10 per area)",
x = "Area", y = "Estimasi λ",
color = "Metode") +
theme_minimal(base_size = 12)Figure 5.1: Perbandingan estimasi EB vs estimasi langsung per area. Kedua estimasi umumnya konsisten, namun EB cenderung lebih stabil.
n_area <- length(unique(data_sae$area))
area_levels <- levels(data_sae$area)
EB_jack <- matrix(NA, nrow = n_area, ncol = n_area)
for (i in seq_len(n_area)) {
area_out <- area_levels[i]
data_jack <- filter(data_sae, area != area_out)
model_jack <- glmmTMB(
y ~ x1 + x2 + kategori + (1 | area),
data = data_jack,
family = poisson
)
data_jack$pred_lambda <- predict(model_jack, type = "response", re.form = NULL)
pred_jack <- data_jack %>%
group_by(area) %>%
summarise(EB_jack = mean(pred_lambda), .groups = "drop")
for (j in seq_len(n_area)) {
if (area_levels[j] != area_out) {
idx <- pred_jack$area == area_levels[j]
if (any(idx)) EB_jack[i, j] <- pred_jack$EB_jack[idx]
}
}
}
# Hitung MSE Jackknife
jackknife_mse <- rep(NA_real_, n_area)
for (j in seq_len(n_area)) {
est_jack <- EB_jack[, j]
valid_est <- est_jack[!is.na(est_jack)]
theta_bar <- mean(valid_est)
jackknife_mse[j] <- ((length(valid_est) - 1) / length(valid_est)) *
mean((valid_est - theta_bar)^2)
}
EB_main$MSE_jack <- jackknife_mse
EB_main$SE_jack <- sqrt(jackknife_mse)
EB_main$RSE_jack <- 100 * EB_main$SE_jack / EB_main$EB_estimatekable(EB_main, caption = "Hasil Estimasi EB dengan MSE Jackknife",
booktabs = TRUE, align = "c", digits = 4,
col.names = c("Area", "EB Estimate", "MSE (Jack)", "SE (Jack)", "RSE Jack (%)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
column_spec(5, color = ifelse(EB_main$RSE_jack > 5, "darkorange", "darkgreen"),
bold = TRUE)| Area | EB Estimate | MSE (Jack) | SE (Jack) | RSE Jack (%) |
|---|---|---|---|---|
| 1 | 2.1601 | 0.0007 | 0.0261 | 1.2085 |
| 2 | 3.3671 | 0.0015 | 0.0387 | 1.1497 |
| 3 | 3.0399 | 0.0004 | 0.0190 | 0.6244 |
| 4 | 3.5204 | 0.0005 | 0.0215 | 0.6101 |
| 5 | 1.1504 | 0.0002 | 0.0140 | 1.2135 |
| 6 | 3.6833 | 0.0010 | 0.0323 | 0.8770 |
| 7 | 2.4561 | 0.0003 | 0.0180 | 0.7334 |
| 8 | 1.4332 | 0.0005 | 0.0223 | 1.5587 |
| 9 | 3.9729 | 0.0006 | 0.0255 | 0.6411 |
| 10 | 1.7978 | 0.0003 | 0.0172 | 0.9585 |
| 11 | 2.6098 | 0.0004 | 0.0198 | 0.7591 |
| 12 | 4.8768 | 0.0009 | 0.0302 | 0.6202 |
| 13 | 1.6754 | 0.0007 | 0.0262 | 1.5656 |
| 14 | 2.9117 | 0.0003 | 0.0163 | 0.5593 |
| 15 | 1.1161 | 0.0003 | 0.0177 | 1.5823 |
| 16 | 2.5394 | 0.0020 | 0.0444 | 1.7489 |
| 17 | 2.2991 | 0.0006 | 0.0255 | 1.1074 |
| 18 | 2.5243 | 0.0004 | 0.0209 | 0.8292 |
| 19 | 2.9766 | 0.0004 | 0.0197 | 0.6609 |
| 20 | 3.2867 | 0.0006 | 0.0238 | 0.7235 |
bootstrap_rse_area <- function(data, n_bootstrap = 100) {
area_list <- unique(data$area)
results <- data.frame(area = area_list, RSE_bootstrap = NA_real_)
for (a in area_list) {
bootstrap_est <- numeric(n_bootstrap)
for (i in seq_len(n_bootstrap)) {
idx <- sample(seq_len(nrow(data)), replace = TRUE)
data_boot <- data[idx, ]
model_boot <- tryCatch(
glmmTMB(y ~ x1 + x2 + kategori + (1 | area),
data = data_boot, family = poisson),
error = function(e) NULL
)
if (is.null(model_boot)) {
bootstrap_est[i] <- NA_real_
next
}
data_boot$pred_lambda <- predict(model_boot, type = "response",
re.form = NULL)
eb_boot <- data_boot %>%
group_by(area) %>%
summarise(EB_boot = mean(pred_lambda), .groups = "drop")
boot_val <- eb_boot$EB_boot[eb_boot$area == a]
bootstrap_est[i] <- if (length(boot_val) == 0) NA_real_ else boot_val
}
valid_boot <- bootstrap_est[!is.na(bootstrap_est)]
mean_eb <- mean(valid_boot)
se_eb <- sd(valid_boot)
rse <- 100 * se_eb / mean_eb
results$RSE_bootstrap[results$area == a] <- rse
}
return(results)
}
bootstrap_result <- bootstrap_rse_area(data_sae, n_bootstrap = 100)final_result <- left_join(direct_est, EB_main, by = "area") %>%
left_join(bootstrap_result, by = "area") %>%
mutate(area_num = as.numeric(as.character(area)))
display_final <- final_result %>%
select(area, y_direct, EB_estimate, SE_direct, SE_jack,
RSE_direct, RSE_jack, RSE_bootstrap) %>%
mutate(across(where(is.numeric), ~ round(.x, 3)))
kable(display_final,
caption = "Tabel Lengkap: Direct Estimate vs Empirical Bayes per Area",
booktabs = TRUE, align = "c",
col.names = c("Area", "Direct Est.", "EB Est.", "SE Direct",
"SE Jack", "RSE Direct(%)", "RSE Jack(%)", "RSE Boot(%)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE, font_size = 11) %>%
add_header_above(c(" " = 1, "Estimasi Titik" = 2,
"Standar Error" = 2, "RSE (%)" = 3))| Area | Direct Est. | EB Est. | SE Direct | SE Jack | RSE Direct(%) | RSE Jack(%) | RSE Boot(%) |
|---|---|---|---|---|---|---|---|
| 1 | 1.8 | 2.160 | 0.554 | 0.026 | 30.765 | 1.208 | 21.086 |
| 2 | 3.7 | 3.367 | 1.096 | 0.039 | 29.620 | 1.150 | 30.504 |
| 3 | 3.1 | 3.040 | 1.242 | 0.019 | 40.075 | 0.624 | 32.790 |
| 4 | 3.6 | 3.520 | 1.293 | 0.021 | 35.909 | 0.610 | 31.462 |
| 5 | 1.2 | 1.150 | 0.442 | 0.014 | 36.851 | 1.213 | 34.571 |
| 6 | 4.3 | 3.683 | 1.446 | 0.032 | 33.621 | 0.877 | 33.653 |
| 7 | 2.3 | 2.456 | 0.857 | 0.018 | 37.261 | 0.733 | 29.043 |
| 8 | 1.1 | 1.433 | 0.379 | 0.022 | 34.418 | 1.559 | 25.658 |
| 9 | 3.8 | 3.973 | 1.356 | 0.025 | 35.696 | 0.641 | 30.778 |
| 10 | 1.9 | 1.798 | 0.781 | 0.017 | 41.107 | 0.958 | 35.080 |
| 11 | 2.4 | 2.610 | 0.945 | 0.020 | 39.382 | 0.759 | 34.508 |
| 12 | 5.1 | 4.877 | 2.116 | 0.030 | 41.487 | 0.620 | 39.725 |
| 13 | 1.3 | 1.675 | 0.667 | 0.026 | 51.346 | 1.566 | 48.498 |
| 14 | 2.9 | 2.912 | 0.737 | 0.016 | 25.418 | 0.559 | 20.978 |
| 15 | 0.8 | 1.116 | 0.416 | 0.018 | 52.042 | 1.582 | 33.523 |
| 16 | 3.0 | 2.539 | 1.095 | 0.044 | 36.515 | 1.749 | 44.210 |
| 17 | 2.5 | 2.299 | 0.582 | 0.025 | 23.286 | 1.107 | 22.984 |
| 18 | 2.6 | 2.524 | 0.968 | 0.021 | 37.246 | 0.829 | 33.429 |
| 19 | 2.8 | 2.977 | 0.854 | 0.020 | 30.491 | 0.661 | 25.086 |
| 20 | 3.4 | 3.287 | 1.137 | 0.024 | 33.448 | 0.723 | 38.973 |
rse_summary <- final_result %>%
summarise(
`RSE Direct - Mean` = round(mean(RSE_direct, na.rm = TRUE), 2),
`RSE Direct - Median` = round(median(RSE_direct, na.rm = TRUE), 2),
`RSE Direct - Max` = round(max(RSE_direct, na.rm = TRUE), 2),
`RSE Jack - Mean` = round(mean(RSE_jack, na.rm = TRUE), 2),
`RSE Jack - Median` = round(median(RSE_jack, na.rm = TRUE), 2),
`RSE Jack - Max` = round(max(RSE_jack, na.rm = TRUE), 2),
`RSE Boot - Mean` = round(mean(RSE_bootstrap, na.rm = TRUE), 2),
`RSE Boot - Median` = round(median(RSE_bootstrap, na.rm = TRUE), 2),
`RSE Boot - Max` = round(max(RSE_bootstrap, na.rm = TRUE), 2)
) %>%
pivot_longer(everything(), names_to = "Statistik", values_to = "Nilai") %>%
mutate(Metode = sub(" - .*", "", Statistik),
Ukuran = sub(".* - ", "", Statistik)) %>%
pivot_wider(names_from = Ukuran, values_from = Nilai) %>%
select(Metode, Mean, Median, Max)
kable(rse_summary, caption = "Ringkasan RSE per Metode (dalam %)",
booktabs = TRUE, align = "c") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, background = "#2c7bb6", color = "white")| Metode | Mean | Median | Max |
|---|---|---|---|
| RSE Direct | 36.30 | NA | NA |
| RSE Direct | NA | 36.21 | NA |
| RSE Direct | NA | NA | 52.04 |
| RSE Jack | 0.99 | NA | NA |
| RSE Jack | NA | 0.85 | NA |
| RSE Jack | NA | NA | 1.75 |
| RSE Boot | 32.33 | NA | NA |
| RSE Boot | NA | 33.11 | NA |
| RSE Boot | NA | NA | 48.50 |
rse_long <- final_result %>%
select(area, area_num, RSE_direct, RSE_jack, RSE_bootstrap) %>%
pivot_longer(cols = c(RSE_direct, RSE_jack, RSE_bootstrap),
names_to = "Method",
values_to = "RSE") %>%
mutate(Method = factor(Method,
levels = c("RSE_direct", "RSE_jack", "RSE_bootstrap"),
labels = c("Direct", "Jackknife", "Bootstrap")))
ggplot(rse_long, aes(x = area_num, y = RSE, color = Method)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2.5) +
geom_hline(yintercept = 25, linetype = "dashed",
color = "gray40", linewidth = 0.8) +
annotate("text", x = 18, y = 26.5, label = "RSE = 25%",
color = "gray40", size = 3.5) +
scale_color_manual(values = c("Direct" = "tomato",
"Jackknife" = "steelblue",
"Bootstrap" = "darkgreen")) +
scale_x_continuous(breaks = 1:20) +
labs(
title = "Perbandingan RSE: Direct, Jackknife, dan Bootstrap",
subtitle = "RSE ditampilkan untuk masing-masing metode estimasi per area",
x = "Area",
y = "Relative Standard Error (%)",
color = "Metode Estimasi"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "top",
panel.grid.minor = element_blank())Figure 9.1: Perbandingan RSE per area untuk tiga metode estimasi. RSE Jackknife dan Bootstrap umumnya jauh lebih rendah dibandingkan RSE Direct, menunjukkan keunggulan estimasi EB.
ggplot(rse_long, aes(x = Method, y = RSE, fill = Method)) +
geom_boxplot(alpha = 0.7, outlier.shape = 21, outlier.size = 3) +
geom_jitter(width = 0.15, alpha = 0.5, size = 1.5) +
scale_fill_manual(values = c("Direct" = "tomato",
"Jackknife" = "steelblue",
"Bootstrap" = "darkgreen")) +
geom_hline(yintercept = 25, linetype = "dashed", color = "gray40") +
labs(
title = "Distribusi RSE per Metode Estimasi",
subtitle = "Titik-titik mewakili nilai RSE per area",
x = "Metode",
y = "RSE (%)"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")Figure 9.2: Distribusi RSE per metode. Jackknife dan Bootstrap menunjukkan dispersi RSE yang jauh lebih kecil dibandingkan estimasi langsung.
ggplot(final_result, aes(x = y_direct, y = EB_estimate)) +
geom_abline(slope = 1, intercept = 0, color = "gray60",
linetype = "dashed", linewidth = 1) +
geom_point(aes(color = RSE_direct), size = 3) +
geom_text(aes(label = area), vjust = -0.7, size = 2.8) +
scale_color_gradient2(low = "darkgreen", mid = "orange", high = "red",
midpoint = 25,
name = "RSE Direct (%)") +
labs(
title = "EB Estimate vs Direct Estimate per Area",
subtitle = "Garis putus-putus = garis y = x (estimasi identik)",
x = "Direct Estimate (Rata-rata Y)",
y = "Empirical Bayes Estimate"
) +
theme_minimal(base_size = 12)Figure 9.3: Scatter plot estimasi EB vs estimasi langsung. Titik di atas garis y=x berarti EB menghasilkan estimasi lebih tinggi dari Direct, dan sebaliknya.
final_result <- final_result %>%
mutate(
gain_jack = RSE_direct - RSE_jack,
gain_boot = RSE_direct - RSE_bootstrap
)
gain_long <- final_result %>%
select(area_num, gain_jack, gain_boot) %>%
pivot_longer(cols = c(gain_jack, gain_boot),
names_to = "method",
values_to = "gain") %>%
mutate(method = case_when(
method == "gain_jack" ~ "Jackknife",
method == "gain_boot" ~ "Bootstrap",
TRUE ~ method
))
ggplot(gain_long, aes(x = area_num, y = gain, fill = gain > 0)) +
geom_col(alpha = 0.8) +
geom_hline(yintercept = 0, color = "black") +
facet_wrap(~method, nrow = 2) +
scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato"),
labels = c("TRUE" = "EB lebih baik", "FALSE" = "Direct lebih baik"),
name = "") +
scale_x_continuous(breaks = 1:20) +
labs(
title = "Gain RSE: Direct − EB per Area",
subtitle = "Biru = EB lebih presisi | Merah = Direct lebih presisi",
x = "Area",
y = "Selisih RSE (Direct − EB)"
) +
theme_minimal(base_size = 11) +
theme(legend.position = "top",
panel.grid.minor = element_blank())Figure 9.4: Gain (penurunan RSE) dari penggunaan metode EB (Jackknife) dibandingkan estimasi langsung. Gain positif berarti EB memberikan estimasi lebih presisi.
final_result <- final_result %>%
mutate(
EB_lower = EB_estimate - 1.96 * SE_jack,
EB_upper = EB_estimate + 1.96 * SE_jack
)
ggplot(final_result %>% arrange(EB_estimate) %>%
mutate(area_ord = factor(area, levels = area)),
aes(y = reorder(area, EB_estimate), x = EB_estimate)) +
geom_errorbarh(aes(xmin = EB_lower, xmax = EB_upper),
height = 0.3, color = "steelblue", linewidth = 0.8) +
geom_point(size = 2.5, color = "steelblue") +
geom_point(aes(x = y_direct), shape = 17, color = "tomato", size = 2) +
labs(
title = "Estimasi EB dengan Interval Kepercayaan 95% (Jackknife)",
subtitle = "Biru (●) = EB | Merah (▲) = Direct Estimate",
x = "Estimasi λ",
y = "Area"
) +
theme_minimal(base_size = 11)Figure 9.5: Interval kepercayaan 95% estimasi EB per area berdasarkan SE Jackknife. Area diurutkan berdasarkan nilai EB estimate.
data_sae$fitted <- fitted(model_glmm)
data_sae$resid_p <- (data_sae$y - data_sae$fitted) / sqrt(data_sae$fitted)
p_res1 <- ggplot(data_sae, aes(x = fitted, y = resid_p)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_hline(yintercept = 0, color = "tomato", linetype = "dashed") +
geom_smooth(method = "loess", se = TRUE, color = "darkred", linewidth = 0.8) +
labs(title = "Residual Pearson vs Fitted",
x = "Nilai Fitted (λ̂)", y = "Residual Pearson") +
theme_minimal(base_size = 11)
p_res2 <- ggplot(data_sae, aes(sample = resid_p)) +
stat_qq(color = "steelblue", alpha = 0.7) +
stat_qq_line(color = "tomato", linetype = "dashed") +
labs(title = "Q-Q Plot Residual Pearson",
x = "Kuantil Teoretis", y = "Kuantil Sampel") +
theme_minimal(base_size = 11)
p_res1 | p_res2Figure 10.1: Residual Pearson model GLMM Poisson. Residual yang tersebar acak di sekitar nol mengindikasikan model yang baik.
## AIC : 677.32
## BIC : 697.11
## LogLik : -332.66
# Pearson Chi-Square
pearson_chi2 <- sum((data_sae$y - data_sae$fitted)^2 / data_sae$fitted)
df_resid <- nrow(data_sae) - length(fixef(model_glmm)$cond) - 1
cat("\nPearson Chi² :", round(pearson_chi2, 2), "\n")##
## Pearson Chi² : 188.12
## df Residual : 194
cat("Rasio (Chi²/df):", round(pearson_chi2 / df_resid, 3),
"← (mendekati 1 = tidak overdispersi)\n")## Rasio (Chi²/df): 0.97 ← (mendekati 1 = tidak overdispersi)
Model GLMM Poisson berhasil mengidentifikasi struktur data simulasi dengan baik:
Koefisien Fixed Effect: Tanda dan magnitudo koefisien \(\hat{\beta}_1 \approx 0.30\) (positif) dan \(\hat{\beta}_2 \approx -0.20\) (negatif) konsisten dengan nilai true yang digunakan dalam simulasi, menunjukkan bahwa model mampu me-recover parameter yang sebenarnya.
Random Effect: Standar deviasi random intercept yang diestimasi mendekati nilai true \(\sigma_u = 0.5\), mengkonfirmasi bahwa model dapat menangkap variasi antar area dengan baik.
Overdispersi: Rasio Pearson Chi²/df yang mendekati 1 menunjukkan tidak ada overdispersi signifikan, sehingga asumsi Poisson terpenuhi pada data simulasi ini.
| Aspek | Direct Estimate | EB (Jackknife/Bootstrap) |
|---|---|---|
| Dasar Perhitungan | Hanya data area sendiri | Seluruh data (meminjam kekuatan) |
| Bias | Tidak bias (design-unbiased) | Sedikit bias (model-dependent) |
| Presisi (RSE) | Tinggi (RSE besar) | Rendah (RSE kecil) |
| Cocok untuk | Area dengan sampel besar | Area dengan sampel kecil |
| Asumsi model | Tidak perlu | Diperlukan |
Dari hasil analisis:
| Aspek | Jackknife | Bootstrap |
|---|---|---|
| Mekanisme | Leave-one-area-out | Re-sampling dengan pengembalian |
| Komputasi | \(m\) iterasi (\(m\) = jumlah area) | \(B\) iterasi (umumnya 200–1000) |
| Asumsi | Lebih sedikit asumsi distribusi | Memerlukan asumsi i.i.d. |
| Akurasi | Baik untuk bias estimasi | Baik untuk distribusi estimasi |
Kedua metode memberikan hasil yang konsisten dalam penelitian ini, mengkonfirmasi bahwa ketidakpastian estimasi EB per area relatif kecil dibandingkan estimasi langsung.
Area dengan sampel kecil (n < 10): EB memberikan keunggulan paling signifikan karena estimasi langsung sangat tidak stabil.
Borrowing Strength: GLMM memungkinkan area “meminjam informasi” dari area lain melalui fixed effect yang diestimasi secara bersama, sekaligus mempertahankan heterogenitas melalui random effect.
Validasi Model Penting: Keunggulan EB bergantung pada ketepatan spesifikasi model. Jika model salah spesifikasi (misal distribusi tidak Poisson), estimasi EB bisa lebih buruk dari estimasi langsung. Diagnostik model (cek overdispersi, residual) sangat krusial.
Pilihan Metode MSE: Pada data nyata dengan ukuran area bervariasi, Bootstrap umumnya lebih robust, namun membutuhkan waktu komputasi lebih lama. Jackknife lebih cepat dan merupakan pilihan praktis yang baik.
Dokumen ini mendemonstrasikan secara lengkap penerapan Small Area Estimation (SAE) dengan pendekatan Empirical Bayes berbasis GLMM Poisson:
✅ GLMM Poisson berhasil memodelkan data cacahan dengan struktur hierarki area, mengidentifikasi fixed effect dan random effect secara akurat.
✅ Estimasi EB per area menghasilkan nilai yang lebih stabil dibandingkan estimasi langsung, terutama untuk area dengan variabilitas tinggi.
✅ Jackknife MSE memberikan kuantifikasi ketidakpastian estimasi EB yang efisien secara komputasi.
✅ Bootstrap RSE mengonfirmasi hasil Jackknife dan memberikan perspektif tambahan tentang distribusi estimasi.
✅ RSE EB (Jackknife & Bootstrap) secara konsisten lebih rendah dari RSE Direct, mengkonfirmasi keunggulan pendekatan EB dalam meningkatkan presisi estimasi area kecil.
Rekomendasi: Untuk data cacahan pada official statistics (jumlah kemiskinan, kejadian penyakit, dll.), pendekatan GLMM Poisson-EB ini sangat direkomendasikan, terutama saat ukuran sampel area kecil. Perlu dilakukan pengujian overdispersi dan pertimbangan model Negative Binomial sebagai alternatif jika ditemukan overdispersi.
Dokumen ini dibuat menggunakan R Markdown dengan paket rmdformats::material. Analisis berbasis simulasi data untuk keperluan demonstrasi metodologi SAE.