Dokumen ini menyajikan materi Statistika Penarikan Contoh secara komprehensif: review statistika dasar, konsep survei, Simple Random Sampling, Stratified, Systematic, hingga Cluster Sampling — dilengkapi penurunan rumus, contoh soal tingkat lanjut, dan visualisasi menggunakan R & ggplot2.
Peta Materi:
| Bab | Topik | Konsep Kunci |
|---|---|---|
| P01 | Review Statistika Dasar | Parameter, statistik, bias, MSE |
| P02 | Konsep Dasar Survei | Kerangka sampel, sampling/non-sampling error |
| P03 | SRS | FPC, penduga rataan, total, proporsi |
| P04 | Stratified Sampling | Alokasi proporsional, Neyman, optimum |
| P05 | Systematic Sampling | Periodisitas, korelasi intrakelas \(\rho\) |
| P06 | Cluster Sampling | Ratio estimator, ICC, ANOVA |
Mengapa Sampling Diperlukan?
| Alasan | Penjelasan |
|---|---|
| Keterbatasan sumber daya | Sensus mahal dan butuh waktu panjang |
| Kecepatan | Sampel dapat dianalisis lebih cepat |
| Akurasi lebih tinggi | Pengamatan terfokus mengurangi measurement error |
| Sifat destruktif | Uji ketahanan bahan, rasa makanan, dll. |
| Inaccessibility | Populasi tidak dapat dijangkau seluruhnya |
| Populasi (Parameter) | Sampel (Statistik) | |
|---|---|---|
| Rataan | \(\mu = \frac{1}{N}\sum Y_i\) | \(\bar{y} = \frac{1}{n}\sum y_i\) |
| Varians | \(\sigma^2 = \frac{\sum(Y_i-\mu)^2}{N}\) | \(s^2 = \frac{\sum(y_i-\bar{y})^2}{n-1}\) |
| Proporsi | \(P\) | \(\hat{p} = a/n\) |
| Total | \(\tau = N\mu\) | \(\hat{\tau} = N\bar{y}\) |
Penduga \(\hat{\theta}\) dikatakan baik bila memenuhi:
Penurunan MSE:
\(MSE(\hat\theta) = E[(\hat\theta-\theta)^2] = E[(\hat\theta - E\hat\theta + E\hat\theta - \theta)^2]\)
\(= E[(\hat\theta - E\hat\theta)^2] + 2\underbrace{E[(\hat\theta-E\hat\theta)]}_{=0}(E\hat\theta-\theta) + (E\hat\theta-\theta)^2\)
\(\therefore\quad MSE(\hat\theta) = Var(\hat\theta) + Bias^2(\hat\theta)\)
Soal P01.1 — Analisis Bias dan MSE
Populasi \(N=3\): \(\{2,4,6\}\), \(\mu=4\). Ambil \(n=2\) tanpa pengembalian. Evaluasi \(\hat\mu_1=\bar{y}\) dan \(\hat\mu_2=\frac{y_{(1)}+y_{(2)}}{3}\).
populasi <- c(2, 4, 6)
mu <- mean(populasi)
combs <- combn(populasi, 2, simplify=FALSE)
df_est <- do.call(rbind, lapply(combs, function(s){
data.frame(Sampel=paste0("{",s[1],",",s[2],"}"),
mu1=mean(s), mu2=(s[1]+s[2])/3)
}))
e1 <- mean(df_est$mu1); e2 <- mean(df_est$mu2)
b1 <- e1-mu; b2 <- e2-mu
v1 <- mean((df_est$mu1-e1)^2); v2 <- mean((df_est$mu2-e2)^2)
mse1 <- v1+b1^2; mse2 <- v2+b2^2
cat("===== Semua Kemungkinan Sampel =====\n"); print(df_est)## ===== Semua Kemungkinan Sampel =====
## Sampel mu1 mu2
## 1 {2,4} 3 2.000000
## 2 {2,6} 4 2.666667
## 3 {4,6} 5 3.333333
##
## μ populasi = 4.00
##
## --- μ̂₁ = ȳ ---
## E=4.0000 | Bias=0.0000 | Var=0.6667 | MSE=0.6667
##
## --- μ̂₂ = (y₁+y₂)/3 ---
## E=2.6667 | Bias=-1.3333 | Var=0.2963 | MSE=2.0741
df_mse <- data.frame(
Penduga = rep(c("μ̂₁ = ȳ (Tak Bias)","μ̂₂ = (y₁+y₂)/3 (Bias)"), each=3),
Komponen = rep(c("Varians","Bias²","MSE"), 2),
Nilai = c(v1, b1^2, mse1, v2, b2^2, mse2)
)
df_mse$Komponen <- factor(df_mse$Komponen, levels=c("Varians","Bias²","MSE"))
ggplot(df_mse, aes(x=Komponen, y=Nilai, fill=Komponen)) +
geom_col(color="white", linewidth=.7, width=.6) +
geom_text(aes(label=round(Nilai,3)), vjust=-.4, fontface="bold", size=4) +
facet_wrap(~Penduga) +
scale_fill_manual(values=c(pal["teal"],pal["coral"],pal["navy"])) +
labs(title="Dekomposisi MSE: Varians + Bias²",
subtitle="Penduga tak bias: Bias²=0, sehingga MSE = Varians",
x="", y="Nilai") +
th() + theme(legend.position="none",
strip.text=element_text(face="bold", size=11))Soal P01.2 — Robustness: Rataan vs Median Terhadap Outlier
Simulasi 100 data pendapatan (log-normal) dengan 5 outlier ekstrem.
set.seed(123)
bersih <- rlnorm(95, meanlog=2, sdlog=.5)
outlier <- c(500, 600, 700, 800, 1000)
kotor <- c(bersih, outlier)
cat(sprintf("%-18s %12s %12s\n","Ukuran","Data Bersih","Data Kotor"))## Ukuran Data Bersih Data Kotor
## Rataan 8.454 44.031
## Median 7.654 7.872
## Std Dev 3.862 160.628
## IQR 4.505 5.319
df_dens <- rbind(
data.frame(x=bersih, status="Data Bersih"),
data.frame(x=kotor, status="Data + Outlier")
)
ggplot(df_dens, aes(x=x, fill=status, color=status)) +
geom_density(alpha=.3, linewidth=1) +
geom_vline(xintercept=mean(bersih), color=pal["teal"], linetype="dashed", linewidth=1) +
geom_vline(xintercept=mean(kotor), color=pal["coral"], linetype="dashed", linewidth=1) +
geom_vline(xintercept=median(bersih),color=pal["navy"], linetype="solid", linewidth=.9) +
annotate("text",x=mean(bersih)+1.5,y=.06,label="Mean Bersih",color=pal["teal"],size=3.5) +
annotate("text",x=mean(kotor)+1.5, y=.04,label="Mean+Outlier",color=pal["coral"],size=3.5) +
scale_fill_manual(values=c(pal["teal"],pal["coral"])) +
scale_color_manual(values=c(pal["teal"],pal["coral"])) +
coord_cartesian(xlim=c(0,30)) +
labs(title="Efek Outlier: Rataan vs Median",
subtitle="Median jauh lebih robust — tidak bergeser drastis akibat outlier",
x="Pendapatan (juta Rp)", y="Kepadatan", fill="", color="") +
th()💡 Breakdown Point: Median tahan hingga 50% data diganti outlier. Rataan memiliki breakdown point 0% — satu outlier saja sudah menggeser nilainya.
| Istilah | Definisi | Contoh |
|---|---|---|
| Unsur (Element) | Objek tempat pengukuran diambil | Individu, rumah tangga |
| Satuan PC | Kumpulan unsur tidak tumpang-tindih | RT, desa |
| Kerangka (Frame) | Daftar satuan PC sebagai dasar sampling | Buku alamat, daftar KK |
| Target Populasi | Populasi yang ingin digeneralisasi | Semua mahasiswa aktif |
| Studi Populasi | Populasi yang benar-benar tercakup frame | Mahasiswa terdaftar |
| Probability Sampling | Non-Probability Sampling | |
|---|---|---|
| Peluang terpilih | Diketahui, > 0 | Tidak diketahui |
| Inferensi statistik | Valid | Tidak valid secara formal |
| Contoh | SRS, Stratified, Cluster | Convenience, Snowball, Quota |
| Kelemahan | Butuh kerangka sampling | Rentan selection bias |
\[\text{Total Error} = \text{Sampling Error} + \text{Non-sampling Error}\]
Sampling Error — dapat dikurangi dengan memperbesar \(n\): \[SE(\bar{y}) = \sqrt{\frac{s^2}{n}\left(1-\frac{n}{N}\right)}\]
Non-sampling Error — tidak hilang meski \(n\) diperbesar:
Soal P02.1 — Non-Response Bias (Worst-Case Analysis)
Dari 200 sampel, 120 merespon (rata-rata kepuasan = 4.2/5). Perkirakan kisaran nilai populasi bila non-responden sangat tidak puas.
n_resp <- 120; n_nonresp <- 80; mean_resp <- 4.2
nonresp_range <- seq(1.0, 4.2, by=0.1)
mean_pop <- (n_resp*mean_resp + n_nonresp*nonresp_range) / (n_resp+n_nonresp)
df_nr <- data.frame(mean_nonresp=nonresp_range, mean_pop=mean_pop)
cat("=== Skenario Non-Response ===\n")## === Skenario Non-Response ===
for(k in c(1.5, 2.0, 2.5, 3.0, 4.2))
cat(sprintf("Non-resp mean=%.1f → Est.populasi=%.3f\n",
k, (n_resp*mean_resp+n_nonresp*k)/(n_resp+n_nonresp)))## Non-resp mean=1.5 → Est.populasi=3.120
## Non-resp mean=2.0 → Est.populasi=3.320
## Non-resp mean=2.5 → Est.populasi=3.520
## Non-resp mean=3.0 → Est.populasi=3.720
## Non-resp mean=4.2 → Est.populasi=4.200
worst_y <- (n_resp*mean_resp + n_nonresp*1.5)/(n_resp+n_nonresp)
ggplot(df_nr, aes(mean_nonresp, mean_pop)) +
geom_ribbon(aes(ymin=mean_pop, ymax=4.2), fill=pal["coral"], alpha=.15) +
geom_line(color=pal["teal"], linewidth=1.4) +
geom_point(data=data.frame(x=1.5, y=worst_y), aes(x=x,y=y),
color=pal["coral"], size=4) +
annotate("text",x=1.7,y=worst_y-.07,label="Skenario terburuk",
color=pal["coral"],size=3.5) +
geom_hline(yintercept=4.2, linetype="dashed", color=pal["amber"], linewidth=1) +
annotate("text",x=3.8,y=4.25,label="Jika semua merespon (4.2)",
color=pal["amber"],size=3.5) +
labs(title="Dampak Non-Response Bias pada Estimasi Kepuasan",
subtitle="Area merah = rentang bias yang mungkin terjadi",
x="Rata-rata Kepuasan Non-Responden",
y="Estimasi Rata-rata Populasi") +
th()Soal P02.2 — Coverage Error: Survei Internet Pedesaan
Frame = daftar telepon rumah. Analisis arah bias estimasi proporsi pengguna internet.
df_pop <- data.frame(
Kelompok = c("Tel. Rumah\n(Tercakup)","Hanya HP\n(Parsial)","Tanpa Telp\n(Excluded)"),
Proporsi = c(0.30, 0.50, 0.20),
PenggunaNet = c(0.75, 0.60, 0.25),
Cakupan = c("Ya","Parsial","Tidak")
)
p_true <- sum(df_pop$Proporsi * df_pop$PenggunaNet)
p_frame <- df_pop$PenggunaNet[1]
ggplot(df_pop, aes(x=Kelompok, y=PenggunaNet, fill=Cakupan)) +
geom_col(color="white", linewidth=.7, width=.6) +
geom_hline(yintercept=p_true, color=pal["teal"], linetype="dashed", linewidth=1.2) +
geom_hline(yintercept=p_frame, color=pal["coral"],linetype="dashed", linewidth=1.2) +
annotate("text",x=3.5,y=p_true+.02,
label=paste0("P populasi = ",round(p_true,2)),
color=pal["teal"],size=3.5,hjust=1) +
annotate("text",x=3.5,y=p_frame-.03,
label=paste0("P dari frame = ",p_frame," (overestimate)"),
color=pal["coral"],size=3.5,hjust=1) +
scale_fill_manual(values=c("Tidak"=pal["coral"],"Parsial"=pal["amber"],"Ya"=pal["teal"])) +
scale_y_continuous(labels=percent_format()) +
labs(title="Ilustrasi Coverage Error pada Survei Internet",
subtitle=sprintf("Bias = %.2f — overestimate karena kelompok tanpa akses tidak tercakup",
p_frame-p_true),
x="", y="Proporsi Pengguna Internet", fill="Cakupan Frame") +
th()SRS adalah metode di mana setiap kemungkinan sampel berukuran \(n\) dari populasi \(N\) memiliki peluang terpilih yang sama: \(\binom{N}{n}^{-1}\).
| SRS-WR (dengan kembali) | SRS-WOR (tanpa kembali) | |
|---|---|---|
| \(Var(\bar{y})\) | \(\sigma^2/n\) | \(\frac{\sigma^2}{n}\cdot\frac{N-n}{N-1}\) |
| FPC | Tidak ada | \(\frac{N-n}{N-1}\) |
| Efisiensi | Lebih rendah | Lebih tinggi |
1. Rataan: \[\bar{y} = \frac{1}{n}\sum y_i \qquad \widehat{Var}(\bar{y}) = \frac{s^2}{n}\!\left(1-\frac{n}{N}\right) \qquad B = 2\sqrt{\widehat{Var}(\bar{y})}\]
2. Total: \[\hat{\tau} = N\bar{y} \qquad \widehat{Var}(\hat\tau) = N^2\widehat{Var}(\bar{y})\]
3. Proporsi: \[\hat{p}=\frac{a}{n} \qquad \widehat{Var}(\hat{p})=\frac{\hat{p}(1-\hat{p})}{n-1}\!\left(1-\frac{n}{N}\right)\]
Dalam SRS-WOR unit-unit tidak independen. Kovarians antar unit: \[Cov(y_i,y_j) = -\frac{\sigma^2}{N-1} \quad (i\neq j)\]
Ekspansi \(Var(\bar{y}) = \frac{1}{n^2}\!\left[n\sigma^2 + n(n-1)\!\left(-\frac{\sigma^2}{N-1}\right)\right] = \frac{\sigma^2}{n}\cdot\frac{N-n}{N-1}\)
Jika \(n\to N\): FPC \(\to 0\) → sensus, tidak ada
ketidakpastian.
Jika \(n/N \leq 5\%\): FPC \(\approx 1\), dapat diabaikan.
Untuk menduga \(\mu\) dengan batas kesalahan \(B\) (kepercayaan 95%): \[n = \frac{N\sigma^2}{(N-1)D + \sigma^2}, \quad D = \frac{B^2}{4}\]
Untuk menduga \(p\): \[n = \frac{Np(1-p)}{(N-1)D + p(1-p)}\]
Bila \(N\to\infty\): \(n \approx \sigma^2/D\) atau \(n \approx p(1-p)/D\).
Soal P03.1 — Pendugaan Total Produksi Kopi
\(N=1500\) pohon, SRS \(n=50\). Estimasi total produksi dan 95% CI.
set.seed(456); N <- 1500; n <- 50
data_kopi <- rnorm(n, mean=15, sd=3)
y_bar <- mean(data_kopi); s_sq <- var(data_kopi)
var_ybar <- (s_sq/n)*(1-n/N)
tau_hat <- N*y_bar
var_tau <- N^2*var_ybar
B_tau <- 2*sqrt(var_tau)
ci_low <- tau_hat - qnorm(.975)*sqrt(var_tau)
ci_high <- tau_hat + qnorm(.975)*sqrt(var_tau)
cat(sprintf("ȳ = %.3f kg | s² = %.3f\n", y_bar, s_sq))## ȳ = 15.442 kg | s² = 10.046
## Var(ȳ) = 0.19422 | SE(ȳ) = 0.4407
## τ̂ = 23163.2 kg | SE(τ̂) = 661.05
## B = 1322.10 kg
## CI 95%: [21867.6, 24458.9] kg
# Distribusi sampling simulasi
set.seed(789)
pop_kopi <- rnorm(N, 15, 3)
tau_sim <- replicate(3000, N*mean(sample(pop_kopi, n)))
ggplot(data.frame(tau=tau_sim), aes(x=tau)) +
geom_histogram(aes(y=after_stat(density)), bins=50,
fill=pal["teal"], color="white", alpha=.85) +
geom_vline(xintercept=tau_hat, color=pal["coral"], linewidth=1.3, linetype="dashed") +
geom_vline(xintercept=c(ci_low,ci_high), color=pal["amber"],
linewidth=1, linetype="dotted") +
annotate("text",x=tau_hat+60,y=.007,
label=sprintf("τ̂ = %.0f kg",tau_hat),
color=pal["coral"],hjust=0,size=4) +
labs(title="Distribusi Sampling τ̂ (Simulasi 3000 sampel)",
subtitle="Garis merah = estimasi titik | Garis kuning = batas CI 95%",
x="Estimasi Total Produksi (kg)", y="Densitas") + th()Soal P03.2 — Visualisasi Efek FPC
Bandingkan SE dengan dan tanpa FPC pada berbagai fraksi sampling.
N <- 1000; sigma <- 50
n_seq <- 1:(N-1)
se_wr <- sigma/sqrt(n_seq)
se_wor <- sigma/sqrt(n_seq)*sqrt((N-n_seq)/(N-1))
df_fpc <- data.frame(n=n_seq,
se_wr=se_wr, se_wor=se_wor,
frac=n_seq/N)
p1 <- ggplot(df_fpc, aes(x=n)) +
geom_line(aes(y=se_wr, color="SRS-WR"), linewidth=1.2) +
geom_line(aes(y=se_wor,color="SRS-WOR"), linewidth=1.2) +
geom_vline(xintercept=0.05*N, linetype="dashed", color="#aaa") +
annotate("text",x=60,y=12,label="5% N",color="#777",size=3.5) +
scale_color_manual(values=c(pal["coral"],pal["teal"])) +
labs(title="SE: WR vs WOR", x="n", y="SE(ȳ)", color="") + th()
p2 <- ggplot(df_fpc, aes(x=frac, y=sqrt((N-n_seq)/(N-1)))) +
geom_line(color=pal["purple"], linewidth=1.3) +
geom_hline(yintercept=1, linetype="dashed", color=pal["coral"]) +
scale_x_continuous(labels=percent_format()) +
labs(title="Nilai FPC vs Fraksi Sampling",
x="Fraksi Sampling (n/N)", y="Nilai FPC") + th()
grid.arrange(p1, p2, ncol=2)Soal P03.3 — Ukuran Sampel untuk Proporsi
Proporsi persetujuan \(p=0.6\), \(B=0.03\), kepercayaan 95%. Berapa \(n\) minimum?
p_prior <- 0.6; B <- 0.03; D <- B^2/4
N_vals <- c(500, 1000, 5000, 10000)
n_vals <- ceiling(N_vals*p_prior*(1-p_prior) /
((N_vals-1)*D + p_prior*(1-p_prior)))
n_inf <- ceiling(p_prior*(1-p_prior)/D)
n_cons <- ceiling(0.5*0.5/D)
cat("=== Ukuran Sampel Minimum (p=0.6, B=0.03) ===\n")## === Ukuran Sampel Minimum (p=0.6, B=0.03) ===
df_n <- data.frame(N=format(N_vals,big.mark=","), n_min=n_vals)
df_n <- rbind(df_n, data.frame(N="∞", n_min=n_inf))
print(df_n)## N n_min
## 1 500 341
## 2 1,000 517
## 3 5,000 880
## 4 10,000 964
## 5 ∞ 1067
##
## Konservatif p=0.5: n = 1112
Stratified Random Sampling (PCAB) membagi populasi menjadi \(L\) strata saling lepas, lalu menerapkan SRS dalam setiap strata.
Prinsip: Homogen dalam strata, heterogen antar strata.
Rataan berlapis: \[\bar{y}_{st} = \sum_{h=1}^L W_h\bar{y}_h, \quad W_h = N_h/N\]
Variansi: \[\widehat{Var}(\bar{y}_{st}) = \sum_{h=1}^L W_h^2\frac{s_h^2}{n_h}\!\left(1-\frac{n_h}{N_h}\right)\]
Total: \(\hat\tau_{st} = N\bar{y}_{st} = \sum_{h=1}^L N_h\bar{y}_h\)
| Alokasi | Rumus \(n_h\) | Keunggulan |
|---|---|---|
| Proporsional | \(n_h = n\,\dfrac{N_h}{N}\) | Sederhana, menjamin representasi |
| Neyman | \(n_h = n\,\dfrac{N_h s_h}{\sum N_j s_j}\) | Meminimasi \(Var\) bila biaya sama |
| Optimum | \(n_h = n\,\dfrac{N_h s_h/\sqrt{c_h}}{\sum N_j s_j/\sqrt{c_j}}\) | Meminimasi \(Var\) dengan biaya berbeda |
Hierarki: \(Var_{Opt} \leq Var_{Neyman} \leq Var_{Prop} \leq Var_{SRS}\)
Pembuktian Stratified ≥ SRS:
\[Var_{SRS}(\bar{y}) \approx Var_{Prop}(\bar{y}_{st}) + \frac{1}{n}\sum_{h=1}^L W_h(\mu_h-\mu)^2\]
Karena \(\sum W_h(\mu_h-\mu)^2 \geq
0\), maka \(Var_{SRS} \geq
Var_{Prop}\).
Keduanya sama bila \(\mu_h = \mu\)
untuk semua \(h\) (strata tidak
informatif).
Soal P04.1 — Survei Konsumsi Listrik (3 Strata)
\(N=6200\) pelanggan. Bandingkan alokasi Proporsional, Neyman, dan Optimum dengan \(n=300\).
N_h <- c(5000, 1000, 200)
s_h <- c(50, 200, 500)
c_h <- c(9, 25, 100)
N <- sum(N_h); n_tot <- 300
n_prop <- round(n_tot * N_h/N)
n_neyman <- round(n_tot * (N_h*s_h)/sum(N_h*s_h))
n_optim <- round(n_tot * (N_h*s_h/sqrt(c_h))/sum(N_h*s_h/sqrt(c_h)))
df_alok <- data.frame(
Strata = rep(c("RT\n(N=5000,s=50)","Bisnis\n(N=1000,s=200)","Industri\n(N=200,s=500)"),3),
Alokasi = rep(c("Proporsional","Neyman","Optimum"), each=3),
n_h = c(n_prop, n_neyman, n_optim)
)
df_alok$Alokasi <- factor(df_alok$Alokasi, levels=c("Proporsional","Neyman","Optimum"))
ggplot(df_alok, aes(x=Strata, y=n_h, fill=Alokasi)) +
geom_col(position="dodge", color="white", linewidth=.6) +
geom_text(aes(label=n_h), position=position_dodge(.9),
vjust=-.4, size=3.8, fontface="bold") +
scale_fill_manual(values=c(pal["teal"],pal["coral"],pal["purple"])) +
labs(title="Perbandingan Tiga Skema Alokasi Sampel",
subtitle="Neyman & Optimum mengalokasikan lebih banyak ke strata bervariansi tinggi",
x="", y="Jumlah Sampel per Strata (nₕ)", fill="Metode") +
th()##
## === Tabel Alokasi ===
print(data.frame(Strata=c("RT","Bisnis","Industri"),
N_h=N_h, s_h=s_h, c_h=c_h,
Prop=n_prop, Neyman=n_neyman, Optimum=n_optim))## Strata N_h s_h c_h Prop Neyman Optimum
## 1 RT 5000 50 9 242 136 188
## 2 Bisnis 1000 200 25 48 109 90
## 3 Industri 200 500 100 10 55 23
var_st <- function(nh){ sum((N_h/N)^2 * s_h^2/nh * (1-nh/N_h)) }
cat("=== Variansi per Skema Alokasi ===\n")## === Variansi per Skema Alokasi ===
## Proporsional : Var=51.7455202 | SE=7.19344
## Neyman : Var=23.5652831 | SE=4.85441
## Optimum : Var=28.8546790 | SE=5.37166
Soal P04.2 — Estimasi Proporsi Berlapis
\(N_1=3000\) (Eksakta), \(N_2=2000\) (Non-Eksakta). Sampel \(n_1=150\) (90 setuju), \(n_2=100\) (80 setuju). Hitung \(\hat{p}_{st}\) dan CI 95%.
N_s <- c(3000,2000); N_tot <- sum(N_s)
n_s <- c(150,100); setuju <- c(90,80)
p_h <- setuju/n_s
p_st <- sum(N_s/N_tot * p_h)
var_h <- p_h*(1-p_h)/(n_s-1) * (1-n_s/N_s)
var_stp <- sum((N_s/N_tot)^2 * var_h)
B_p <- 2*sqrt(var_stp)
cat(sprintf("p̂₁ = %d/%d = %.4f | p̂₂ = %d/%d = %.4f\n",setuju[1],n_s[1],p_h[1],setuju[2],n_s[2],p_h[2]))## p̂₁ = 90/150 = 0.6000 | p̂₂ = 80/100 = 0.8000
## p̂_st = 0.6800 | SE = 0.0282 | B = 0.0564
## CI 95%: [0.6236, 0.7364]
df_ci <- data.frame(
Strata=c("Eksakta","Non-Eksakta","Gabungan (st)"),
p=c(p_h,p_st), se=c(sqrt(var_h),sqrt(var_stp))
)
ggplot(df_ci, aes(y=Strata, x=p, color=Strata)) +
geom_errorbarh(aes(xmin=p-2*se, xmax=p+2*se), height=.25, linewidth=1.4) +
geom_point(size=4) +
scale_color_manual(values=c(pal["teal"],pal["coral"],pal["navy"])) +
scale_x_continuous(labels=percent_format(), limits=c(.45,1)) +
labs(title="CI 95% Proporsi Setuju Kuliah Hybrid per Strata",
x="Proporsi (%)", y="") +
th() + theme(legend.position="none")Systematic Sampling memilih unit awal secara acak (\(m\), \(1\leq m\leq k\)) kemudian mengambil setiap unit ke-\(k\): urutan \(m,\, m+k,\, m+2k,\,\ldots\)
Interval: \(k = \lfloor N/n \rfloor\)
Variansi teoritis: \[Var(\bar{y}_{sys}) = \frac{\sigma^2}{n}[1+(n-1)\rho]\]
\(\rho\) = korelasi intrakelas antar unit dalam satu sampel sistematik.
| Tipe | \(\rho\) | Efisiensi vs SRS |
|---|---|---|
| Acak | \(\approx 0\) | Sama |
| Terurut (ada tren) | \(< 0\) | Lebih baik |
| Periodik (\(k\) = periode) | \(> 0\) | Lebih buruk (berbahaya) |
Soal P05.1 — Simulasi Tiga Tipe Populasi
\(N=500\), \(n=50\). Bandingkan distribusi \(\bar{y}_{sys}\) pada populasi acak, terurut, dan periodik.
set.seed(2024)
N <- 500; n <- 50; k <- N/n
pop_acak <- rnorm(N, 50, 10)
pop_terurut <- seq(10, 90, length=N) + rnorm(N, 0, 3)
pop_periodik <- 50 + 30*sin(2*pi*(1:N)/k) + rnorm(N, 0, 2)
sim_sys <- function(pop){
sapply(1:k, function(m) mean(pop[seq(m, by=k, length.out=n)]))
}
r_acak <- sim_sys(pop_acak)
r_urut <- sim_sys(pop_terurut)
r_peri <- sim_sys(pop_periodik)
df_sim <- rbind(
data.frame(xbar=r_acak, tipe="Populasi Acak\n(ρ ≈ 0)"),
data.frame(xbar=r_urut, tipe="Populasi Terurut\n(ρ < 0)"),
data.frame(xbar=r_peri, tipe="Populasi Periodik\n(ρ > 0, berbahaya)")
)
df_sim$tipe <- factor(df_sim$tipe,
levels=c("Populasi Acak\n(ρ ≈ 0)","Populasi Terurut\n(ρ < 0)",
"Populasi Periodik\n(ρ > 0, berbahaya)"))
ggplot(df_sim, aes(x=xbar, fill=tipe)) +
geom_histogram(aes(y=after_stat(density)), bins=20, alpha=.8, color="white") +
geom_vline(xintercept=50, color=pal["navy"], linetype="dashed", linewidth=1) +
facet_wrap(~tipe, ncol=1, scales="free_y") +
scale_fill_manual(values=c(pal["teal"],pal["sage"],pal["coral"])) +
labs(title="Distribusi Sampling ȳ_sys pada Tiga Tipe Populasi",
subtitle="Lebar distribusi mencerminkan variansi estimasi",
x="Nilai ȳ_sys", y="Densitas") +
th() + theme(legend.position="none",
strip.text=element_text(face="bold", size=10))var_tbl <- data.frame(
Tipe = c("Acak","Terurut","Periodik"),
Var_sys = round(c(var(r_acak), var(r_urut), var(r_peri)), 5),
Var_SRS = round(rep(var(pop_acak)/n*(1-n/N),3), 5)
)
var_tbl$RE <- round(var_tbl$Var_SRS/var_tbl$Var_sys, 3)
var_tbl$Status <- c("Sama","Lebih Efisien","Lebih Buruk")
cat("=== Variansi: Sistematik vs SRS ===\n"); print(var_tbl)## === Variansi: Sistematik vs SRS ===
## Tipe Var_sys Var_SRS RE Status
## 1 Acak 1.68427 1.69849 1.008 Sama
## 2 Terurut 0.58364 1.69849 2.910 Lebih Efisien
## 3 Periodik 504.00735 1.69849 0.003 Lebih Buruk
Soal P05.2 — Bahaya Periodisitas: Omzet Supermarket
Sistematik dengan \(k=7\) pada data omzet mingguan → setiap sampel hanya mencakup satu hari.
set.seed(99)
hari <- 1:365
siklus <- c(80, 75, 85, 90, 95, 150, 160)
omzet <- siklus[((hari-1)%%7)+1] + rnorm(365, 0, 8)
mu_true <- mean(omzet)
means7 <- sapply(1:7, function(m) mean(omzet[seq(m,365,7)]))
df_sys7 <- data.frame(
Start=1:7,
Hari =c("Senin","Selasa","Rabu","Kamis","Jumat","Sabtu","Minggu"),
Mean =round(means7,2),
Bias =round(means7-mu_true,2)
)
cat("=== Estimasi per Start Day ===\n"); print(df_sys7)## === Estimasi per Start Day ===
## Start Hari Mean Bias
## 1 1 Senin 80.56 -23.75
## 2 2 Selasa 73.24 -31.08
## 3 3 Rabu 86.13 -18.18
## 4 4 Kamis 88.19 -16.13
## 5 5 Jumat 93.84 -10.48
## 6 6 Sabtu 149.63 45.32
## 7 7 Minggu 159.07 54.75
##
## Rata-rata Populasi = 104.31
## Rentang estimasi : [73.24, 159.07]
df_omz <- data.frame(hari=hari, omzet=omzet)
ggplot(df_omz, aes(x=hari, y=omzet)) +
geom_line(color="#ccc", linewidth=.4) +
geom_point(data=df_omz[seq(1,365,7),],color=pal["teal"], size=1.5) +
geom_point(data=df_omz[seq(7,365,7),],color=pal["coral"],size=1.5) +
geom_hline(yintercept=mu_true, linetype="dashed",color=pal["navy"], linewidth=1) +
geom_hline(yintercept=means7[1], linetype="dotted",color=pal["teal"], linewidth=1) +
geom_hline(yintercept=means7[7], linetype="dotted",color=pal["coral"], linewidth=1) +
annotate("text",x=355,y=mu_true+4, label="Populasi", color=pal["navy"], size=3.5,hjust=1) +
annotate("text",x=355,y=means7[1]-4, label="Senin (under)", color=pal["teal"], size=3.5,hjust=1) +
annotate("text",x=355,y=means7[7]+4, label="Minggu (over)", color=pal["coral"],size=3.5,hjust=1) +
labs(title="Periodisitas k=7 pada Data Omzet Mingguan",
subtitle="Sampel hari Senin under-estimate; hari Minggu over-estimate",
x="Hari ke-", y="Omzet (juta Rp)") + th()⚠️ Aturan: Jangan gunakan \(k =\) kelipatan periode siklus. Solusi: gunakan stratified dengan strata = hari dalam minggu, atau SRS penuh.
Soal P05.3 — Korelasi Intrakelas: Bukti \(Var=0\) saat \(\rho=-1/(n-1)\)
n_val <- 10; sigma2 <- 100
rho_kritis <- -1/(n_val-1)
var_kritis <- sigma2/n_val*(1+(n_val-1)*rho_kritis)
cat(sprintf("n=%d, σ²=%.0f\n", n_val, sigma2))## n=10, σ²=100
## ρ kritis = -1/(n-1) = -0.111111
cat(sprintf("Var(ȳ_sys) = %.0f/%.0f × [1+%d×(%.6f)] = %.8f ≈ 0 ✓\n",
sigma2,n_val,n_val-1,rho_kritis,var_kritis))## Var(ȳ_sys) = 100/10 × [1+9×(-0.111111)] = 0.00000000 ≈ 0 ✓
# Plot Var vs rho
rho_seq <- seq(rho_kritis, 0.9, by=0.01)
df_icc <- data.frame(rho=rho_seq,
var=sigma2/n_val*(1+(n_val-1)*rho_seq))
ggplot(df_icc, aes(rho, var)) +
geom_line(color=pal["teal"], linewidth=1.3) +
geom_hline(yintercept=sigma2/n_val, color=pal["amber"], linetype="dashed") +
geom_vline(xintercept=0, color="#aaa", linetype="dotted") +
geom_point(x=rho_kritis, y=0, size=4, color=pal["coral"]) +
annotate("text",x=rho_kritis+.05,y=2,
label=sprintf("ρ=%.4f\nVar=0",rho_kritis),
color=pal["coral"],size=3.5) +
annotate("text",x=.7,y=sigma2/n_val+1.5,
label="Var_SRS",color=pal["amber"],size=3.5) +
labs(title="Var(ȳ_sys) sebagai Fungsi Korelasi Intrakelas ρ",
x="ρ (korelasi intrakelas)", y="Var(ȳ_sys)") + th()Cluster Sampling (PCG) menggunakan gerombol (kumpulan unsur) sebagai satuan penarikan contoh. Semua unsur dalam gerombol terpilih wajib diamati.
Prinsip: Heterogen dalam gerombol, homogen antar gerombol — berlawanan dengan Stratified.
\(N\) = total gerombol, \(n\) = gerombol sampel, \(m_i\) = ukuran gerombol ke-\(i\), \(y_i\) = total nilai gerombol ke-\(i\):
\[\hat{\bar{Y}} = \frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n m_i}\]
\[\widehat{Var}(\hat{\bar{Y}}) = \left(1-\frac{n}{N}\right)\frac{1}{n\bar{M}^2}\cdot\frac{\sum_{i=1}^n(y_i-\hat{\bar{Y}}\cdot m_i)^2}{n-1}\]
\[\rho_{ICC} \approx \frac{MSB - MSW}{MSB + (M-1)MSW}\]
\[RE = \frac{Var_{SRS}}{Var_{cluster}} = \frac{1}{1+(M-1)\rho}\]
Soal P06.1 — Estimasi Rasio: Nilai Ujian Siswa SD
\(N=100\) sekolah, \(n=5\) sekolah terpilih, \(\bar{M}=45\). Estimasi rata-rata nilai per siswa.
df_cl <- data.frame(
Sekolah = paste0("S",1:5),
m_i = c(50, 20, 100, 30, 40),
y_i = c(3500, 1600, 6500, 2400, 3000)
)
n_cl <- 5; N_cl <- 100; M_bar <- 45
ydb <- sum(df_cl$y_i)/sum(df_cl$m_i)
df_cl$rata_i <- df_cl$y_i/df_cl$m_i
df_cl$diff_sq <- (df_cl$y_i - ydb*df_cl$m_i)^2
var_ydb <- (1-n_cl/N_cl)/(n_cl*M_bar^2)*sum(df_cl$diff_sq)/(n_cl-1)
B_cl <- 2*sqrt(var_ydb)
cat("=== Data Gerombol ===\n"); print(df_cl)## === Data Gerombol ===
## Sekolah m_i y_i rata_i diff_sq
## 1 S1 50 3500 70 1736.111
## 2 S2 20 1600 80 33611.111
## 3 S3 100 6500 65 340277.778
## 4 S4 30 2400 80 75625.000
## 5 S5 40 3000 75 27777.778
##
## ȳ̄ = Σy_i/Σm_i = 17000/240 = 70.83
## Var(ȳ̄) = 11.236454 | SE = 3.3521 | B = 6.7042
## CI 95%: [64.13, 77.54]
p1 <- ggplot(df_cl, aes(x=reorder(Sekolah,-m_i), y=rata_i, fill=Sekolah)) +
geom_col(color="white",linewidth=.6,width=.6) +
geom_hline(yintercept=ydb,linetype="dashed",color=pal["navy"],linewidth=1) +
annotate("text",x=5,y=ydb+.8,label=sprintf("ȳ̄=%.1f",ydb),
color=pal["navy"],hjust=1,size=4) +
scale_fill_manual(values=unname(pal[1:5])) +
labs(title="Rata-rata Nilai per Sekolah",x="Sekolah",y="Rata-rata Nilai") +
th() + theme(legend.position="none")
p2 <- ggplot(df_cl, aes(x=Sekolah, y=m_i, fill=Sekolah)) +
geom_col(color="white",linewidth=.6,width=.6) +
scale_fill_manual(values=unname(pal[1:5])) +
labs(title="Ukuran Gerombol (Jumlah Siswa)",x="Sekolah",y="mᵢ") +
th() + theme(legend.position="none")
grid.arrange(p1, p2, ncol=2)Soal P06.2 — Efisiensi Relatif vs ICC
Visualisasikan RE cluster sampling pada berbagai nilai \(\rho\) dan ukuran gerombol \(M\).
M_vals <- c(5, 10, 20, 50)
rho_seq <- seq(-0.2, 0.8, by=0.01)
df_re <- expand.grid(M=M_vals, rho=rho_seq) |>
mutate(RE=1/(1+(M-1)*rho), M_lab=paste0("M=",M))
ggplot(df_re, aes(x=rho, y=RE, color=factor(M), group=factor(M))) +
geom_line(linewidth=1.2) +
geom_hline(yintercept=1, linetype="dashed", color="#aaa") +
geom_vline(xintercept=0, linetype="dotted", color="#ccc") +
annotate("text",x=.65,y=1.08,label="RE=1 (≡ SRS)",color="#777",size=3.5) +
annotate("rect",xmin=0,xmax=.8,ymin=-Inf,ymax=1,
fill=pal["coral"],alpha=.07) +
annotate("text",x=.4,y=.3,label="Cluster kurang\nefisien dari SRS",
color=pal["coral"],size=3.5) +
scale_color_manual(values=c(pal["sage"],pal["teal"],pal["amber"],pal["coral"])) +
labs(title="Efisiensi Relatif (RE) Cluster vs SRS",
x="Korelasi Intrakelas (ρ)", y="RE = Var(SRS)/Var(Cluster)",
color="Ukuran Gerombol") + th()Soal P06.3 — ANOVA untuk Menghitung ICC
Simulasikan dua kasus: gerombol homogen dalam (ICC tinggi) vs heterogen dalam (ICC rendah).
set.seed(2025)
N_cl2 <- 20; M2 <- 10
means_c <- rnorm(N_cl2, 70, 15)
pop1 <- unlist(lapply(means_c, function(m) rnorm(M2, m, 3)))
pop2 <- unlist(lapply(seq(60,80,length=N_cl2), function(m) rnorm(M2, 70, 12)))
grp <- rep(1:N_cl2, each=M2)
hitung_icc <- function(y, grp){
ms <- summary(aov(y~factor(grp)))[[1]][,"Mean Sq"]
MSB <- ms[1]; MSW <- ms[2]; M <- length(y)/length(unique(grp))
rho <- (MSB-MSW)/(MSB+(M-1)*MSW)
list(MSB=MSB, MSW=MSW, rho=rho, RE=1/(1+(M-1)*rho))
}
r1 <- hitung_icc(pop1, grp); r2 <- hitung_icc(pop2, grp)
cat("=== Kasus 1: Gerombol Homogen dalam, Heterogen antar ===\n")## === Kasus 1: Gerombol Homogen dalam, Heterogen antar ===
cat(sprintf("MSB=%.1f | MSW=%.1f | ρ=%.4f | RE=%.4f → %s\n\n",
r1$MSB,r1$MSW,r1$rho,r1$RE,
ifelse(r1$RE<1,"KURANG EFISIEN dari SRS","LEBIH EFISIEN dari SRS")))## MSB=2010.3 | MSW=8.8 | ρ=0.9577 | RE=0.1040 → KURANG EFISIEN dari SRS
## === Kasus 2: Gerombol Heterogen dalam, Homogen antar ===
cat(sprintf("MSB=%.1f | MSW=%.1f | ρ=%.4f | RE=%.4f → %s\n",
r2$MSB,r2$MSW,r2$rho,r2$RE,
ifelse(r2$RE<1,"KURANG EFISIEN dari SRS","LEBIH EFISIEN dari SRS")))## MSB=95.4 | MSW=142.1 | ρ=-0.0340 | RE=1.4415 → LEBIH EFISIEN dari SRS
df_vis <- rbind(
data.frame(nilai=pop1, gerombol=grp, kasus="Kasus 1: ICC Tinggi (ρ>0)"),
data.frame(nilai=pop2, gerombol=grp, kasus="Kasus 2: ICC Rendah (ρ≈0)")
)
ggplot(df_vis, aes(x=factor(gerombol), y=nilai, color=kasus)) +
geom_jitter(width=.2, alpha=.5, size=.8) +
stat_summary(fun=mean, geom="point", shape=18, size=4, color=pal["navy"]) +
facet_wrap(~kasus, scales="free") +
scale_color_manual(values=c(pal["coral"],pal["teal"])) +
labs(title="Struktur Gerombol dan Pengaruhnya pada ICC",
subtitle="Berlian navy = rata-rata gerombol | Semakin seragam isi gerombol → ICC makin tinggi",
x="Gerombol ke-", y="Nilai") +
th() + theme(legend.position="none",
axis.text.x=element_text(size=7),
strip.text=element_text(face="bold"))tbl <- data.frame(
Metode = c("SRS","Stratified","Systematic","Cluster"),
Frame = c("Daftar unsur","Daftar unsur per strata",
"Daftar unsur terurut","Daftar gerombol"),
Prinsip = c("Semua sampel-n peluang sama",
"Homogen dalam, heterogen antar strata",
"Interval tetap dari start acak",
"Heterogen dalam, homogen antar gerombol"),
Efisiensi = c("Baseline","≥ SRS (bila strata relevan)",
"≥ SRS (terurut); ≤ SRS (periodik)",
"≤ SRS (bila ICC tinggi)"),
Biaya = c("Sedang","Tinggi","Rendah","Sangat Rendah"),
Risiko = c("—","Stratifikasi tidak relevan",
"Periodisitas = k","ICC tinggi")
)
knitr::kable(tbl, caption="Perbandingan Empat Metode Probability Sampling",
col.names=c("Metode","Kebutuhan Frame","Prinsip Desain",
"Efisiensi Relatif","Biaya","Risiko Utama"))| Metode | Kebutuhan Frame | Prinsip Desain | Efisiensi Relatif | Biaya | Risiko Utama |
|---|---|---|---|---|---|
| SRS | Daftar unsur | Semua sampel-n peluang sama | Baseline | Sedang | — |
| Stratified | Daftar unsur per strata | Homogen dalam, heterogen antar strata | ≥ SRS (bila strata relevan) | Tinggi | Stratifikasi tidak relevan |
| Systematic | Daftar unsur terurut | Interval tetap dari start acak | ≥ SRS (terurut); ≤ SRS (periodik) | Rendah | Periodisitas = k |
| Cluster | Daftar gerombol | Heterogen dalam, homogen antar gerombol | ≤ SRS (bila ICC tinggi) | Sangat Rendah | ICC tinggi |
set.seed(2025)
N <- 600; n <- 60; n_sim <- 2000
# Populasi dengan 3 strata (tiap 200 unit)
mu_strata <- c(40, 70, 100)
pop <- unlist(lapply(seq_along(mu_strata), function(h) rnorm(200, mu_strata[h], 8)))
strata_idx <- rep(1:3, each=200)
cluster_idx <- rep(1:60, each=10) # 60 gerombol @10 unit
mu_pop <- mean(pop)
sim_SRS <- replicate(n_sim, mean(sample(pop, n)))
sim_STR <- replicate(n_sim, {
ys <- tapply(seq_along(pop), strata_idx, function(idx)
mean(pop[sample(idx, 20)]))
sum(200/N * ys)
})
sim_SYS <- replicate(n_sim, {
k_sys <- N/n; m <- sample(1:k_sys,1)
mean(pop[seq(m, N, k_sys)])
})
sim_CLU <- replicate(n_sim, {
cl_sel <- sample(1:60, 6)
mean(pop[cluster_idx %in% cl_sel])
})
df_final <- rbind(
data.frame(xbar=sim_SRS, metode="SRS"),
data.frame(xbar=sim_STR, metode="Stratified"),
data.frame(xbar=sim_SYS, metode="Systematic"),
data.frame(xbar=sim_CLU, metode="Cluster")
)
ggplot(df_final, aes(x=xbar, fill=metode, color=metode)) +
geom_density(alpha=.3, linewidth=1.1) +
geom_vline(xintercept=mu_pop, color=pal["navy"], linetype="dashed", linewidth=1.2) +
scale_fill_manual(values=c(pal["teal"],pal["sage"],pal["amber"],pal["coral"])) +
scale_color_manual(values=c(pal["teal"],pal["sage"],pal["amber"],pal["coral"])) +
labs(title="Distribusi Sampling ȳ: Perbandingan Empat Metode",
subtitle="Garis navy = μ populasi | Distribusi lebih sempit = lebih efisien",
x="Nilai ȳ", y="Densitas", fill="Metode", color="Metode") + th()##
## === SE Tiap Metode (Simulasi 2000 kali) ===
se_tbl <- df_final |> group_by(metode) |>
summarise(SE=round(sd(xbar),4), RE_vs_SRS=round(sd(sim_SRS)/sd(xbar),3))
print(as.data.frame(se_tbl))## metode SE RE_vs_SRS
## 1 Cluster 9.7766 0.321
## 2 SRS 3.1360 1.000
## 3 Stratified 0.9636 3.254
## 4 Systematic 0.9116 3.440
- Cochran, W.G. (1977). Sampling Techniques (3rd ed.). Wiley.
- Scheaffer, R.L., Mendenhall, W., & Ott, R.L. (2012). Elementary Survey Sampling (7th ed.). Cengage.
- Lohr, S.L. (2021). Sampling: Design and Analysis (3rd ed.). CRC Press.
- Sukhatme, P.V. (1984). Sampling Theory of Surveys with Applications. Iowa State University.
Seluruh kode R ini dibuat oleh Nida Khairunnissa dan dapat
dijalankan ulang. Pastikan paket ggplot2,
dplyr, tidyr, scales,
gridExtra, dan knitr sudah
terinstall.