1 Pendahuluan

Dokumen ini menyajikan materi Statistika Penarikan Contoh secara komprehensif: review statistika dasar, konsep survei, Simple Random Sampling, Stratified, Systematic, hingga Cluster Sampling — dilengkapi penurunan rumus, contoh soal tingkat lanjut, dan visualisasi menggunakan R & ggplot2.

Peta Materi:

Bab Topik Konsep Kunci
P01 Review Statistika Dasar Parameter, statistik, bias, MSE
P02 Konsep Dasar Survei Kerangka sampel, sampling/non-sampling error
P03 SRS FPC, penduga rataan, total, proporsi
P04 Stratified Sampling Alokasi proporsional, Neyman, optimum
P05 Systematic Sampling Periodisitas, korelasi intrakelas \(\rho\)
P06 Cluster Sampling Ratio estimator, ICC, ANOVA

2 P01: Review Statistika Dasar

2.1 Populasi vs Contoh

  • Populasi — keseluruhan objek yang menjadi perhatian, dicirikan oleh parameter (\(\mu, \sigma^2, p, \tau\)).
  • Contoh (sampel) — subset populasi yang dipilih secara terencana; karakteristiknya disebut statistik (\(\bar{y}, s^2, \hat{p}\)).

Mengapa Sampling Diperlukan?

Alasan Penjelasan
Keterbatasan sumber daya Sensus mahal dan butuh waktu panjang
Kecepatan Sampel dapat dianalisis lebih cepat
Akurasi lebih tinggi Pengamatan terfokus mengurangi measurement error
Sifat destruktif Uji ketahanan bahan, rasa makanan, dll.
Inaccessibility Populasi tidak dapat dijangkau seluruhnya

2.2 Parameter vs Statistik

Populasi (Parameter) Sampel (Statistik)
Rataan \(\mu = \frac{1}{N}\sum Y_i\) \(\bar{y} = \frac{1}{n}\sum y_i\)
Varians \(\sigma^2 = \frac{\sum(Y_i-\mu)^2}{N}\) \(s^2 = \frac{\sum(y_i-\bar{y})^2}{n-1}\)
Proporsi \(P\) \(\hat{p} = a/n\)
Total \(\tau = N\mu\) \(\hat{\tau} = N\bar{y}\)

2.3 Sifat Penduga yang Baik

Penduga \(\hat{\theta}\) dikatakan baik bila memenuhi:

  1. Tak Bias (Unbiased): \(E(\hat{\theta}) = \theta\)
  2. Efisien: \(Var(\hat{\theta})\) terkecil di antara semua penduga tak bias
  3. Konsisten: \(\hat{\theta} \xrightarrow{p} \theta\) saat \(n \to \infty\)
  4. MSE: \(MSE(\hat{\theta}) = Var(\hat{\theta}) + [Bias(\hat{\theta})]^2\)

Penurunan MSE:

\(MSE(\hat\theta) = E[(\hat\theta-\theta)^2] = E[(\hat\theta - E\hat\theta + E\hat\theta - \theta)^2]\)

\(= E[(\hat\theta - E\hat\theta)^2] + 2\underbrace{E[(\hat\theta-E\hat\theta)]}_{=0}(E\hat\theta-\theta) + (E\hat\theta-\theta)^2\)

\(\therefore\quad MSE(\hat\theta) = Var(\hat\theta) + Bias^2(\hat\theta)\)

Soal P01.1 — Analisis Bias dan MSE

Populasi \(N=3\): \(\{2,4,6\}\), \(\mu=4\). Ambil \(n=2\) tanpa pengembalian. Evaluasi \(\hat\mu_1=\bar{y}\) dan \(\hat\mu_2=\frac{y_{(1)}+y_{(2)}}{3}\).

populasi <- c(2, 4, 6)
mu <- mean(populasi)

combs <- combn(populasi, 2, simplify=FALSE)
df_est <- do.call(rbind, lapply(combs, function(s){
  data.frame(Sampel=paste0("{",s[1],",",s[2],"}"),
             mu1=mean(s), mu2=(s[1]+s[2])/3)
}))

e1 <- mean(df_est$mu1); e2 <- mean(df_est$mu2)
b1 <- e1-mu;             b2 <- e2-mu
v1 <- mean((df_est$mu1-e1)^2); v2 <- mean((df_est$mu2-e2)^2)
mse1 <- v1+b1^2;         mse2 <- v2+b2^2

cat("===== Semua Kemungkinan Sampel =====\n"); print(df_est)
## ===== Semua Kemungkinan Sampel =====
##   Sampel mu1      mu2
## 1  {2,4}   3 2.000000
## 2  {2,6}   4 2.666667
## 3  {4,6}   5 3.333333
cat(sprintf("\nμ populasi = %.2f\n",mu))
## 
## μ populasi = 4.00
cat(sprintf("\n--- μ̂₁ = ȳ ---\nE=%.4f | Bias=%.4f | Var=%.4f | MSE=%.4f\n",e1,b1,v1,mse1))
## 
## --- μ̂₁ = ȳ ---
## E=4.0000 | Bias=0.0000 | Var=0.6667 | MSE=0.6667
cat(sprintf("\n--- μ̂₂ = (y₁+y₂)/3 ---\nE=%.4f | Bias=%.4f | Var=%.4f | MSE=%.4f\n",e2,b2,v2,mse2))
## 
## --- μ̂₂ = (y₁+y₂)/3 ---
## E=2.6667 | Bias=-1.3333 | Var=0.2963 | MSE=2.0741
df_mse <- data.frame(
  Penduga  = rep(c("μ̂₁ = ȳ (Tak Bias)","μ̂₂ = (y₁+y₂)/3 (Bias)"), each=3),
  Komponen = rep(c("Varians","Bias²","MSE"), 2),
  Nilai    = c(v1, b1^2, mse1, v2, b2^2, mse2)
)
df_mse$Komponen <- factor(df_mse$Komponen, levels=c("Varians","Bias²","MSE"))

ggplot(df_mse, aes(x=Komponen, y=Nilai, fill=Komponen)) +
  geom_col(color="white", linewidth=.7, width=.6) +
  geom_text(aes(label=round(Nilai,3)), vjust=-.4, fontface="bold", size=4) +
  facet_wrap(~Penduga) +
  scale_fill_manual(values=c(pal["teal"],pal["coral"],pal["navy"])) +
  labs(title="Dekomposisi MSE: Varians + Bias²",
       subtitle="Penduga tak bias: Bias²=0, sehingga MSE = Varians",
       x="", y="Nilai") +
  th() + theme(legend.position="none",
               strip.text=element_text(face="bold", size=11))

Soal P01.2 — Robustness: Rataan vs Median Terhadap Outlier

Simulasi 100 data pendapatan (log-normal) dengan 5 outlier ekstrem.

set.seed(123)
bersih  <- rlnorm(95, meanlog=2, sdlog=.5)
outlier <- c(500, 600, 700, 800, 1000)
kotor   <- c(bersih, outlier)

cat(sprintf("%-18s %12s %12s\n","Ukuran","Data Bersih","Data Kotor"))
## Ukuran              Data Bersih   Data Kotor
cat(sprintf("%-18s %12.3f %12.3f\n","Rataan",   mean(bersih),  mean(kotor)))
## Rataan                    8.454       44.031
cat(sprintf("%-18s %12.3f %12.3f\n","Median", median(bersih),median(kotor)))
## Median                    7.654        7.872
cat(sprintf("%-18s %12.3f %12.3f\n","Std Dev",    sd(bersih),    sd(kotor)))
## Std Dev                   3.862      160.628
cat(sprintf("%-18s %12.3f %12.3f\n","IQR",       IQR(bersih),   IQR(kotor)))
## IQR                       4.505        5.319
df_dens <- rbind(
  data.frame(x=bersih, status="Data Bersih"),
  data.frame(x=kotor,  status="Data + Outlier")
)
ggplot(df_dens, aes(x=x, fill=status, color=status)) +
  geom_density(alpha=.3, linewidth=1) +
  geom_vline(xintercept=mean(bersih),  color=pal["teal"],  linetype="dashed", linewidth=1) +
  geom_vline(xintercept=mean(kotor),   color=pal["coral"], linetype="dashed", linewidth=1) +
  geom_vline(xintercept=median(bersih),color=pal["navy"],  linetype="solid",  linewidth=.9) +
  annotate("text",x=mean(bersih)+1.5,y=.06,label="Mean Bersih",color=pal["teal"],size=3.5) +
  annotate("text",x=mean(kotor)+1.5, y=.04,label="Mean+Outlier",color=pal["coral"],size=3.5) +
  scale_fill_manual(values=c(pal["teal"],pal["coral"])) +
  scale_color_manual(values=c(pal["teal"],pal["coral"])) +
  coord_cartesian(xlim=c(0,30)) +
  labs(title="Efek Outlier: Rataan vs Median",
       subtitle="Median jauh lebih robust — tidak bergeser drastis akibat outlier",
       x="Pendapatan (juta Rp)", y="Kepadatan", fill="", color="") +
  th()

💡 Breakdown Point: Median tahan hingga 50% data diganti outlier. Rataan memiliki breakdown point 0% — satu outlier saja sudah menggeser nilainya.


3 P02: Konsep Dasar Survei

3.1 Sensus vs Survei

  • Sensus: Data dari seluruh populasi. Tidak ada sampling error, tapi mahal dan lambat.
  • Survei: Data dari sebagian populasi. Ada sampling error, namun lebih efisien.

3.2 Istilah Teknis Survei

Istilah Definisi Contoh
Unsur (Element) Objek tempat pengukuran diambil Individu, rumah tangga
Satuan PC Kumpulan unsur tidak tumpang-tindih RT, desa
Kerangka (Frame) Daftar satuan PC sebagai dasar sampling Buku alamat, daftar KK
Target Populasi Populasi yang ingin digeneralisasi Semua mahasiswa aktif
Studi Populasi Populasi yang benar-benar tercakup frame Mahasiswa terdaftar

3.3 Probability vs Non-Probability Sampling

Probability Sampling Non-Probability Sampling
Peluang terpilih Diketahui, > 0 Tidak diketahui
Inferensi statistik Valid Tidak valid secara formal
Contoh SRS, Stratified, Cluster Convenience, Snowball, Quota
Kelemahan Butuh kerangka sampling Rentan selection bias

3.4 Sumber Kesalahan Survei

\[\text{Total Error} = \text{Sampling Error} + \text{Non-sampling Error}\]

Sampling Error — dapat dikurangi dengan memperbesar \(n\): \[SE(\bar{y}) = \sqrt{\frac{s^2}{n}\left(1-\frac{n}{N}\right)}\]

Non-sampling Errortidak hilang meski \(n\) diperbesar:

  • Coverage Error: frame tidak mencakup populasi target
  • Non-response Error: responden tidak menjawab
  • Measurement Error: instrumen atau pewawancara bias

Soal P02.1 — Non-Response Bias (Worst-Case Analysis)

Dari 200 sampel, 120 merespon (rata-rata kepuasan = 4.2/5). Perkirakan kisaran nilai populasi bila non-responden sangat tidak puas.

n_resp <- 120; n_nonresp <- 80; mean_resp <- 4.2
nonresp_range <- seq(1.0, 4.2, by=0.1)
mean_pop <- (n_resp*mean_resp + n_nonresp*nonresp_range) / (n_resp+n_nonresp)
df_nr <- data.frame(mean_nonresp=nonresp_range, mean_pop=mean_pop)

cat("=== Skenario Non-Response ===\n")
## === Skenario Non-Response ===
for(k in c(1.5, 2.0, 2.5, 3.0, 4.2))
  cat(sprintf("Non-resp mean=%.1f → Est.populasi=%.3f\n",
              k, (n_resp*mean_resp+n_nonresp*k)/(n_resp+n_nonresp)))
## Non-resp mean=1.5 → Est.populasi=3.120
## Non-resp mean=2.0 → Est.populasi=3.320
## Non-resp mean=2.5 → Est.populasi=3.520
## Non-resp mean=3.0 → Est.populasi=3.720
## Non-resp mean=4.2 → Est.populasi=4.200
worst_y <- (n_resp*mean_resp + n_nonresp*1.5)/(n_resp+n_nonresp)

ggplot(df_nr, aes(mean_nonresp, mean_pop)) +
  geom_ribbon(aes(ymin=mean_pop, ymax=4.2), fill=pal["coral"], alpha=.15) +
  geom_line(color=pal["teal"], linewidth=1.4) +
  geom_point(data=data.frame(x=1.5, y=worst_y), aes(x=x,y=y),
             color=pal["coral"], size=4) +
  annotate("text",x=1.7,y=worst_y-.07,label="Skenario terburuk",
           color=pal["coral"],size=3.5) +
  geom_hline(yintercept=4.2, linetype="dashed", color=pal["amber"], linewidth=1) +
  annotate("text",x=3.8,y=4.25,label="Jika semua merespon (4.2)",
           color=pal["amber"],size=3.5) +
  labs(title="Dampak Non-Response Bias pada Estimasi Kepuasan",
       subtitle="Area merah = rentang bias yang mungkin terjadi",
       x="Rata-rata Kepuasan Non-Responden",
       y="Estimasi Rata-rata Populasi") +
  th()

Soal P02.2 — Coverage Error: Survei Internet Pedesaan

Frame = daftar telepon rumah. Analisis arah bias estimasi proporsi pengguna internet.

df_pop <- data.frame(
  Kelompok    = c("Tel. Rumah\n(Tercakup)","Hanya HP\n(Parsial)","Tanpa Telp\n(Excluded)"),
  Proporsi    = c(0.30, 0.50, 0.20),
  PenggunaNet = c(0.75, 0.60, 0.25),
  Cakupan     = c("Ya","Parsial","Tidak")
)
p_true  <- sum(df_pop$Proporsi * df_pop$PenggunaNet)
p_frame <- df_pop$PenggunaNet[1]

ggplot(df_pop, aes(x=Kelompok, y=PenggunaNet, fill=Cakupan)) +
  geom_col(color="white", linewidth=.7, width=.6) +
  geom_hline(yintercept=p_true,  color=pal["teal"], linetype="dashed", linewidth=1.2) +
  geom_hline(yintercept=p_frame, color=pal["coral"],linetype="dashed", linewidth=1.2) +
  annotate("text",x=3.5,y=p_true+.02,
           label=paste0("P populasi = ",round(p_true,2)),
           color=pal["teal"],size=3.5,hjust=1) +
  annotate("text",x=3.5,y=p_frame-.03,
           label=paste0("P dari frame = ",p_frame," (overestimate)"),
           color=pal["coral"],size=3.5,hjust=1) +
  scale_fill_manual(values=c("Tidak"=pal["coral"],"Parsial"=pal["amber"],"Ya"=pal["teal"])) +
  scale_y_continuous(labels=percent_format()) +
  labs(title="Ilustrasi Coverage Error pada Survei Internet",
       subtitle=sprintf("Bias = %.2f — overestimate karena kelompok tanpa akses tidak tercakup",
                        p_frame-p_true),
       x="", y="Proporsi Pengguna Internet", fill="Cakupan Frame") +
  th()


4 P03: Penarikan Contoh Acak Sederhana (SRS)

SRS adalah metode di mana setiap kemungkinan sampel berukuran \(n\) dari populasi \(N\) memiliki peluang terpilih yang sama: \(\binom{N}{n}^{-1}\).

4.1 SRS-WR vs SRS-WOR

SRS-WR (dengan kembali) SRS-WOR (tanpa kembali)
\(Var(\bar{y})\) \(\sigma^2/n\) \(\frac{\sigma^2}{n}\cdot\frac{N-n}{N-1}\)
FPC Tidak ada \(\frac{N-n}{N-1}\)
Efisiensi Lebih rendah Lebih tinggi

4.2 Rumus Pendugaan Parameter

1. Rataan: \[\bar{y} = \frac{1}{n}\sum y_i \qquad \widehat{Var}(\bar{y}) = \frac{s^2}{n}\!\left(1-\frac{n}{N}\right) \qquad B = 2\sqrt{\widehat{Var}(\bar{y})}\]

2. Total: \[\hat{\tau} = N\bar{y} \qquad \widehat{Var}(\hat\tau) = N^2\widehat{Var}(\bar{y})\]

3. Proporsi: \[\hat{p}=\frac{a}{n} \qquad \widehat{Var}(\hat{p})=\frac{\hat{p}(1-\hat{p})}{n-1}\!\left(1-\frac{n}{N}\right)\]

4.3 Finite Population Correction (FPC)

Dalam SRS-WOR unit-unit tidak independen. Kovarians antar unit: \[Cov(y_i,y_j) = -\frac{\sigma^2}{N-1} \quad (i\neq j)\]

Ekspansi \(Var(\bar{y}) = \frac{1}{n^2}\!\left[n\sigma^2 + n(n-1)\!\left(-\frac{\sigma^2}{N-1}\right)\right] = \frac{\sigma^2}{n}\cdot\frac{N-n}{N-1}\)

Jika \(n\to N\): FPC \(\to 0\) → sensus, tidak ada ketidakpastian.
Jika \(n/N \leq 5\%\): FPC \(\approx 1\), dapat diabaikan.

4.4 Penentuan Ukuran Contoh

Untuk menduga \(\mu\) dengan batas kesalahan \(B\) (kepercayaan 95%): \[n = \frac{N\sigma^2}{(N-1)D + \sigma^2}, \quad D = \frac{B^2}{4}\]

Untuk menduga \(p\): \[n = \frac{Np(1-p)}{(N-1)D + p(1-p)}\]

Bila \(N\to\infty\): \(n \approx \sigma^2/D\) atau \(n \approx p(1-p)/D\).

Soal P03.1 — Pendugaan Total Produksi Kopi

\(N=1500\) pohon, SRS \(n=50\). Estimasi total produksi dan 95% CI.

set.seed(456); N <- 1500; n <- 50
data_kopi <- rnorm(n, mean=15, sd=3)

y_bar    <- mean(data_kopi); s_sq <- var(data_kopi)
var_ybar <- (s_sq/n)*(1-n/N)
tau_hat  <- N*y_bar
var_tau  <- N^2*var_ybar
B_tau    <- 2*sqrt(var_tau)
ci_low   <- tau_hat - qnorm(.975)*sqrt(var_tau)
ci_high  <- tau_hat + qnorm(.975)*sqrt(var_tau)

cat(sprintf("ȳ      = %.3f kg | s² = %.3f\n", y_bar, s_sq))
## ȳ      = 15.442 kg | s² = 10.046
cat(sprintf("Var(ȳ) = %.5f   | SE(ȳ) = %.4f\n", var_ybar, sqrt(var_ybar)))
## Var(ȳ) = 0.19422   | SE(ȳ) = 0.4407
cat(sprintf("τ̂      = %.1f kg | SE(τ̂) = %.2f\n", tau_hat, sqrt(var_tau)))
## τ̂      = 23163.2 kg | SE(τ̂) = 661.05
cat(sprintf("B      = %.2f kg\nCI 95%%: [%.1f, %.1f] kg\n", B_tau, ci_low, ci_high))
## B      = 1322.10 kg
## CI 95%: [21867.6, 24458.9] kg
# Distribusi sampling simulasi
set.seed(789)
pop_kopi <- rnorm(N, 15, 3)
tau_sim  <- replicate(3000, N*mean(sample(pop_kopi, n)))
ggplot(data.frame(tau=tau_sim), aes(x=tau)) +
  geom_histogram(aes(y=after_stat(density)), bins=50,
                 fill=pal["teal"], color="white", alpha=.85) +
  geom_vline(xintercept=tau_hat, color=pal["coral"], linewidth=1.3, linetype="dashed") +
  geom_vline(xintercept=c(ci_low,ci_high), color=pal["amber"],
             linewidth=1, linetype="dotted") +
  annotate("text",x=tau_hat+60,y=.007,
           label=sprintf("τ̂ = %.0f kg",tau_hat),
           color=pal["coral"],hjust=0,size=4) +
  labs(title="Distribusi Sampling τ̂ (Simulasi 3000 sampel)",
       subtitle="Garis merah = estimasi titik | Garis kuning = batas CI 95%",
       x="Estimasi Total Produksi (kg)", y="Densitas") + th()

Soal P03.2 — Visualisasi Efek FPC

Bandingkan SE dengan dan tanpa FPC pada berbagai fraksi sampling.

N <- 1000; sigma <- 50
n_seq  <- 1:(N-1)
se_wr  <- sigma/sqrt(n_seq)
se_wor <- sigma/sqrt(n_seq)*sqrt((N-n_seq)/(N-1))

df_fpc <- data.frame(n=n_seq,
                     se_wr=se_wr, se_wor=se_wor,
                     frac=n_seq/N)

p1 <- ggplot(df_fpc, aes(x=n)) +
  geom_line(aes(y=se_wr, color="SRS-WR"),  linewidth=1.2) +
  geom_line(aes(y=se_wor,color="SRS-WOR"), linewidth=1.2) +
  geom_vline(xintercept=0.05*N, linetype="dashed", color="#aaa") +
  annotate("text",x=60,y=12,label="5% N",color="#777",size=3.5) +
  scale_color_manual(values=c(pal["coral"],pal["teal"])) +
  labs(title="SE: WR vs WOR", x="n", y="SE(ȳ)", color="") + th()

p2 <- ggplot(df_fpc, aes(x=frac, y=sqrt((N-n_seq)/(N-1)))) +
  geom_line(color=pal["purple"], linewidth=1.3) +
  geom_hline(yintercept=1, linetype="dashed", color=pal["coral"]) +
  scale_x_continuous(labels=percent_format()) +
  labs(title="Nilai FPC vs Fraksi Sampling",
       x="Fraksi Sampling (n/N)", y="Nilai FPC") + th()

grid.arrange(p1, p2, ncol=2)

Soal P03.3 — Ukuran Sampel untuk Proporsi

Proporsi persetujuan \(p=0.6\), \(B=0.03\), kepercayaan 95%. Berapa \(n\) minimum?

p_prior <- 0.6; B <- 0.03; D <- B^2/4

N_vals <- c(500, 1000, 5000, 10000)
n_vals <- ceiling(N_vals*p_prior*(1-p_prior) /
                  ((N_vals-1)*D + p_prior*(1-p_prior)))
n_inf  <- ceiling(p_prior*(1-p_prior)/D)
n_cons <- ceiling(0.5*0.5/D)

cat("=== Ukuran Sampel Minimum (p=0.6, B=0.03) ===\n")
## === Ukuran Sampel Minimum (p=0.6, B=0.03) ===
df_n <- data.frame(N=format(N_vals,big.mark=","), n_min=n_vals)
df_n <- rbind(df_n, data.frame(N="∞", n_min=n_inf))
print(df_n)
##        N n_min
## 1    500   341
## 2  1,000   517
## 3  5,000   880
## 4 10,000   964
## 5      ∞  1067
cat(sprintf("\nKonservatif p=0.5: n = %d\n",n_cons))
## 
## Konservatif p=0.5: n = 1112

5 P04: Penarikan Contoh Acak Berlapis (Stratified)

Stratified Random Sampling (PCAB) membagi populasi menjadi \(L\) strata saling lepas, lalu menerapkan SRS dalam setiap strata.

Prinsip: Homogen dalam strata, heterogen antar strata.

5.1 Rumus Pendugaan

Rataan berlapis: \[\bar{y}_{st} = \sum_{h=1}^L W_h\bar{y}_h, \quad W_h = N_h/N\]

Variansi: \[\widehat{Var}(\bar{y}_{st}) = \sum_{h=1}^L W_h^2\frac{s_h^2}{n_h}\!\left(1-\frac{n_h}{N_h}\right)\]

Total: \(\hat\tau_{st} = N\bar{y}_{st} = \sum_{h=1}^L N_h\bar{y}_h\)

5.2 Tiga Skema Alokasi

Alokasi Rumus \(n_h\) Keunggulan
Proporsional \(n_h = n\,\dfrac{N_h}{N}\) Sederhana, menjamin representasi
Neyman \(n_h = n\,\dfrac{N_h s_h}{\sum N_j s_j}\) Meminimasi \(Var\) bila biaya sama
Optimum \(n_h = n\,\dfrac{N_h s_h/\sqrt{c_h}}{\sum N_j s_j/\sqrt{c_j}}\) Meminimasi \(Var\) dengan biaya berbeda

Hierarki: \(Var_{Opt} \leq Var_{Neyman} \leq Var_{Prop} \leq Var_{SRS}\)

Pembuktian Stratified ≥ SRS:

\[Var_{SRS}(\bar{y}) \approx Var_{Prop}(\bar{y}_{st}) + \frac{1}{n}\sum_{h=1}^L W_h(\mu_h-\mu)^2\]

Karena \(\sum W_h(\mu_h-\mu)^2 \geq 0\), maka \(Var_{SRS} \geq Var_{Prop}\).
Keduanya sama bila \(\mu_h = \mu\) untuk semua \(h\) (strata tidak informatif).

Soal P04.1 — Survei Konsumsi Listrik (3 Strata)

\(N=6200\) pelanggan. Bandingkan alokasi Proporsional, Neyman, dan Optimum dengan \(n=300\).

N_h <- c(5000, 1000, 200)
s_h <- c(50,   200,  500)
c_h <- c(9,    25,   100)
N <- sum(N_h); n_tot <- 300

n_prop   <- round(n_tot * N_h/N)
n_neyman <- round(n_tot * (N_h*s_h)/sum(N_h*s_h))
n_optim  <- round(n_tot * (N_h*s_h/sqrt(c_h))/sum(N_h*s_h/sqrt(c_h)))

df_alok <- data.frame(
  Strata  = rep(c("RT\n(N=5000,s=50)","Bisnis\n(N=1000,s=200)","Industri\n(N=200,s=500)"),3),
  Alokasi = rep(c("Proporsional","Neyman","Optimum"), each=3),
  n_h     = c(n_prop, n_neyman, n_optim)
)
df_alok$Alokasi <- factor(df_alok$Alokasi, levels=c("Proporsional","Neyman","Optimum"))

ggplot(df_alok, aes(x=Strata, y=n_h, fill=Alokasi)) +
  geom_col(position="dodge", color="white", linewidth=.6) +
  geom_text(aes(label=n_h), position=position_dodge(.9),
            vjust=-.4, size=3.8, fontface="bold") +
  scale_fill_manual(values=c(pal["teal"],pal["coral"],pal["purple"])) +
  labs(title="Perbandingan Tiga Skema Alokasi Sampel",
       subtitle="Neyman & Optimum mengalokasikan lebih banyak ke strata bervariansi tinggi",
       x="", y="Jumlah Sampel per Strata (nₕ)", fill="Metode") +
  th()

cat("\n=== Tabel Alokasi ===\n")
## 
## === Tabel Alokasi ===
print(data.frame(Strata=c("RT","Bisnis","Industri"),
                 N_h=N_h, s_h=s_h, c_h=c_h,
                 Prop=n_prop, Neyman=n_neyman, Optimum=n_optim))
##     Strata  N_h s_h c_h Prop Neyman Optimum
## 1       RT 5000  50   9  242    136     188
## 2   Bisnis 1000 200  25   48    109      90
## 3 Industri  200 500 100   10     55      23
var_st <- function(nh){ sum((N_h/N)^2 * s_h^2/nh * (1-nh/N_h)) }
cat("=== Variansi per Skema Alokasi ===\n")
## === Variansi per Skema Alokasi ===
cat(sprintf("Proporsional : Var=%.7f | SE=%.5f\n",var_st(n_prop),  sqrt(var_st(n_prop))))
## Proporsional : Var=51.7455202 | SE=7.19344
cat(sprintf("Neyman       : Var=%.7f | SE=%.5f\n",var_st(n_neyman),sqrt(var_st(n_neyman))))
## Neyman       : Var=23.5652831 | SE=4.85441
cat(sprintf("Optimum      : Var=%.7f | SE=%.5f\n",var_st(n_optim), sqrt(var_st(n_optim))))
## Optimum      : Var=28.8546790 | SE=5.37166

Soal P04.2 — Estimasi Proporsi Berlapis

\(N_1=3000\) (Eksakta), \(N_2=2000\) (Non-Eksakta). Sampel \(n_1=150\) (90 setuju), \(n_2=100\) (80 setuju). Hitung \(\hat{p}_{st}\) dan CI 95%.

N_s <- c(3000,2000); N_tot <- sum(N_s)
n_s <- c(150,100); setuju <- c(90,80)
p_h <- setuju/n_s

p_st    <- sum(N_s/N_tot * p_h)
var_h   <- p_h*(1-p_h)/(n_s-1) * (1-n_s/N_s)
var_stp <- sum((N_s/N_tot)^2 * var_h)
B_p     <- 2*sqrt(var_stp)

cat(sprintf("p̂₁ = %d/%d = %.4f | p̂₂ = %d/%d = %.4f\n",setuju[1],n_s[1],p_h[1],setuju[2],n_s[2],p_h[2]))
## p̂₁ = 90/150 = 0.6000 | p̂₂ = 80/100 = 0.8000
cat(sprintf("p̂_st = %.4f | SE = %.4f | B = %.4f\n",p_st,sqrt(var_stp),B_p))
## p̂_st = 0.6800 | SE = 0.0282 | B = 0.0564
cat(sprintf("CI 95%%: [%.4f, %.4f]\n",p_st-B_p,p_st+B_p))
## CI 95%: [0.6236, 0.7364]
df_ci <- data.frame(
  Strata=c("Eksakta","Non-Eksakta","Gabungan (st)"),
  p=c(p_h,p_st), se=c(sqrt(var_h),sqrt(var_stp))
)
ggplot(df_ci, aes(y=Strata, x=p, color=Strata)) +
  geom_errorbarh(aes(xmin=p-2*se, xmax=p+2*se), height=.25, linewidth=1.4) +
  geom_point(size=4) +
  scale_color_manual(values=c(pal["teal"],pal["coral"],pal["navy"])) +
  scale_x_continuous(labels=percent_format(), limits=c(.45,1)) +
  labs(title="CI 95% Proporsi Setuju Kuliah Hybrid per Strata",
       x="Proporsi (%)", y="") +
  th() + theme(legend.position="none")


6 P05: Penarikan Contoh Sistematik

Systematic Sampling memilih unit awal secara acak (\(m\), \(1\leq m\leq k\)) kemudian mengambil setiap unit ke-\(k\): urutan \(m,\, m+k,\, m+2k,\,\ldots\)

6.1 Prosedur dan Variansi

Interval: \(k = \lfloor N/n \rfloor\)

Variansi teoritis: \[Var(\bar{y}_{sys}) = \frac{\sigma^2}{n}[1+(n-1)\rho]\]

\(\rho\) = korelasi intrakelas antar unit dalam satu sampel sistematik.

6.2 Tiga Tipe Populasi

Tipe \(\rho\) Efisiensi vs SRS
Acak \(\approx 0\) Sama
Terurut (ada tren) \(< 0\) Lebih baik
Periodik (\(k\) = periode) \(> 0\) Lebih buruk (berbahaya)

Soal P05.1 — Simulasi Tiga Tipe Populasi

\(N=500\), \(n=50\). Bandingkan distribusi \(\bar{y}_{sys}\) pada populasi acak, terurut, dan periodik.

set.seed(2024)
N <- 500; n <- 50; k <- N/n

pop_acak     <- rnorm(N, 50, 10)
pop_terurut  <- seq(10, 90, length=N) + rnorm(N, 0, 3)
pop_periodik <- 50 + 30*sin(2*pi*(1:N)/k) + rnorm(N, 0, 2)

sim_sys <- function(pop){
  sapply(1:k, function(m) mean(pop[seq(m, by=k, length.out=n)]))
}

r_acak <- sim_sys(pop_acak)
r_urut <- sim_sys(pop_terurut)
r_peri <- sim_sys(pop_periodik)

df_sim <- rbind(
  data.frame(xbar=r_acak, tipe="Populasi Acak\n(ρ ≈ 0)"),
  data.frame(xbar=r_urut, tipe="Populasi Terurut\n(ρ < 0)"),
  data.frame(xbar=r_peri, tipe="Populasi Periodik\n(ρ > 0, berbahaya)")
)
df_sim$tipe <- factor(df_sim$tipe,
  levels=c("Populasi Acak\n(ρ ≈ 0)","Populasi Terurut\n(ρ < 0)",
           "Populasi Periodik\n(ρ > 0, berbahaya)"))

ggplot(df_sim, aes(x=xbar, fill=tipe)) +
  geom_histogram(aes(y=after_stat(density)), bins=20, alpha=.8, color="white") +
  geom_vline(xintercept=50, color=pal["navy"], linetype="dashed", linewidth=1) +
  facet_wrap(~tipe, ncol=1, scales="free_y") +
  scale_fill_manual(values=c(pal["teal"],pal["sage"],pal["coral"])) +
  labs(title="Distribusi Sampling ȳ_sys pada Tiga Tipe Populasi",
       subtitle="Lebar distribusi mencerminkan variansi estimasi",
       x="Nilai ȳ_sys", y="Densitas") +
  th() + theme(legend.position="none",
               strip.text=element_text(face="bold", size=10))

var_tbl <- data.frame(
  Tipe    = c("Acak","Terurut","Periodik"),
  Var_sys = round(c(var(r_acak), var(r_urut), var(r_peri)), 5),
  Var_SRS = round(rep(var(pop_acak)/n*(1-n/N),3), 5)
)
var_tbl$RE     <- round(var_tbl$Var_SRS/var_tbl$Var_sys, 3)
var_tbl$Status <- c("Sama","Lebih Efisien","Lebih Buruk")
cat("=== Variansi: Sistematik vs SRS ===\n"); print(var_tbl)
## === Variansi: Sistematik vs SRS ===
##       Tipe   Var_sys Var_SRS    RE        Status
## 1     Acak   1.68427 1.69849 1.008          Sama
## 2  Terurut   0.58364 1.69849 2.910 Lebih Efisien
## 3 Periodik 504.00735 1.69849 0.003   Lebih Buruk

Soal P05.2 — Bahaya Periodisitas: Omzet Supermarket

Sistematik dengan \(k=7\) pada data omzet mingguan → setiap sampel hanya mencakup satu hari.

set.seed(99)
hari   <- 1:365
siklus <- c(80, 75, 85, 90, 95, 150, 160)
omzet  <- siklus[((hari-1)%%7)+1] + rnorm(365, 0, 8)
mu_true <- mean(omzet)

means7 <- sapply(1:7, function(m) mean(omzet[seq(m,365,7)]))
df_sys7 <- data.frame(
  Start=1:7,
  Hari =c("Senin","Selasa","Rabu","Kamis","Jumat","Sabtu","Minggu"),
  Mean =round(means7,2),
  Bias =round(means7-mu_true,2)
)
cat("=== Estimasi per Start Day ===\n"); print(df_sys7)
## === Estimasi per Start Day ===
##   Start   Hari   Mean   Bias
## 1     1  Senin  80.56 -23.75
## 2     2 Selasa  73.24 -31.08
## 3     3   Rabu  86.13 -18.18
## 4     4  Kamis  88.19 -16.13
## 5     5  Jumat  93.84 -10.48
## 6     6  Sabtu 149.63  45.32
## 7     7 Minggu 159.07  54.75
cat(sprintf("\nRata-rata Populasi = %.2f\n",mu_true))
## 
## Rata-rata Populasi = 104.31
cat(sprintf("Rentang estimasi  : [%.2f, %.2f]\n",min(means7),max(means7)))
## Rentang estimasi  : [73.24, 159.07]
df_omz <- data.frame(hari=hari, omzet=omzet)
ggplot(df_omz, aes(x=hari, y=omzet)) +
  geom_line(color="#ccc", linewidth=.4) +
  geom_point(data=df_omz[seq(1,365,7),],color=pal["teal"], size=1.5) +
  geom_point(data=df_omz[seq(7,365,7),],color=pal["coral"],size=1.5) +
  geom_hline(yintercept=mu_true,    linetype="dashed",color=pal["navy"],  linewidth=1) +
  geom_hline(yintercept=means7[1],  linetype="dotted",color=pal["teal"],  linewidth=1) +
  geom_hline(yintercept=means7[7],  linetype="dotted",color=pal["coral"], linewidth=1) +
  annotate("text",x=355,y=mu_true+4,   label="Populasi",       color=pal["navy"], size=3.5,hjust=1) +
  annotate("text",x=355,y=means7[1]-4, label="Senin (under)",  color=pal["teal"], size=3.5,hjust=1) +
  annotate("text",x=355,y=means7[7]+4, label="Minggu (over)",  color=pal["coral"],size=3.5,hjust=1) +
  labs(title="Periodisitas k=7 pada Data Omzet Mingguan",
       subtitle="Sampel hari Senin under-estimate; hari Minggu over-estimate",
       x="Hari ke-", y="Omzet (juta Rp)") + th()

⚠️ Aturan: Jangan gunakan \(k =\) kelipatan periode siklus. Solusi: gunakan stratified dengan strata = hari dalam minggu, atau SRS penuh.

Soal P05.3 — Korelasi Intrakelas: Bukti \(Var=0\) saat \(\rho=-1/(n-1)\)

n_val  <- 10; sigma2 <- 100
rho_kritis <- -1/(n_val-1)
var_kritis  <- sigma2/n_val*(1+(n_val-1)*rho_kritis)
cat(sprintf("n=%d, σ²=%.0f\n", n_val, sigma2))
## n=10, σ²=100
cat(sprintf("ρ kritis = -1/(n-1) = %.6f\n", rho_kritis))
## ρ kritis = -1/(n-1) = -0.111111
cat(sprintf("Var(ȳ_sys) = %.0f/%.0f × [1+%d×(%.6f)] = %.8f ≈ 0 ✓\n",
            sigma2,n_val,n_val-1,rho_kritis,var_kritis))
## Var(ȳ_sys) = 100/10 × [1+9×(-0.111111)] = 0.00000000 ≈ 0 ✓
# Plot Var vs rho
rho_seq <- seq(rho_kritis, 0.9, by=0.01)
df_icc  <- data.frame(rho=rho_seq,
                       var=sigma2/n_val*(1+(n_val-1)*rho_seq))
ggplot(df_icc, aes(rho, var)) +
  geom_line(color=pal["teal"], linewidth=1.3) +
  geom_hline(yintercept=sigma2/n_val, color=pal["amber"], linetype="dashed") +
  geom_vline(xintercept=0, color="#aaa", linetype="dotted") +
  geom_point(x=rho_kritis, y=0, size=4, color=pal["coral"]) +
  annotate("text",x=rho_kritis+.05,y=2,
           label=sprintf("ρ=%.4f\nVar=0",rho_kritis),
           color=pal["coral"],size=3.5) +
  annotate("text",x=.7,y=sigma2/n_val+1.5,
           label="Var_SRS",color=pal["amber"],size=3.5) +
  labs(title="Var(ȳ_sys) sebagai Fungsi Korelasi Intrakelas ρ",
       x="ρ (korelasi intrakelas)", y="Var(ȳ_sys)") + th()


7 P06: Penarikan Contoh Gerombol (Cluster Sampling)

Cluster Sampling (PCG) menggunakan gerombol (kumpulan unsur) sebagai satuan penarikan contoh. Semua unsur dalam gerombol terpilih wajib diamati.

Prinsip: Heterogen dalam gerombol, homogen antar gerombol — berlawanan dengan Stratified.

7.1 Pendugaan Rasio (Ukuran Gerombol Tidak Sama)

\(N\) = total gerombol, \(n\) = gerombol sampel, \(m_i\) = ukuran gerombol ke-\(i\), \(y_i\) = total nilai gerombol ke-\(i\):

\[\hat{\bar{Y}} = \frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n m_i}\]

\[\widehat{Var}(\hat{\bar{Y}}) = \left(1-\frac{n}{N}\right)\frac{1}{n\bar{M}^2}\cdot\frac{\sum_{i=1}^n(y_i-\hat{\bar{Y}}\cdot m_i)^2}{n-1}\]

7.2 Korelasi Intrakelas dan Efisiensi Relatif

\[\rho_{ICC} \approx \frac{MSB - MSW}{MSB + (M-1)MSW}\]

\[RE = \frac{Var_{SRS}}{Var_{cluster}} = \frac{1}{1+(M-1)\rho}\]

  • \(\rho > 0\) → PCG kurang efisien (lebih umum terjadi)
  • \(\rho \approx 0\) → PCG \(\equiv\) SRS
  • \(\rho < 0\) → PCG lebih efisien (gerombol sangat heterogen di dalam)

Soal P06.1 — Estimasi Rasio: Nilai Ujian Siswa SD

\(N=100\) sekolah, \(n=5\) sekolah terpilih, \(\bar{M}=45\). Estimasi rata-rata nilai per siswa.

df_cl <- data.frame(
  Sekolah = paste0("S",1:5),
  m_i     = c(50, 20, 100, 30, 40),
  y_i     = c(3500, 1600, 6500, 2400, 3000)
)
n_cl <- 5; N_cl <- 100; M_bar <- 45

ydb    <- sum(df_cl$y_i)/sum(df_cl$m_i)
df_cl$rata_i  <- df_cl$y_i/df_cl$m_i
df_cl$diff_sq <- (df_cl$y_i - ydb*df_cl$m_i)^2
var_ydb <- (1-n_cl/N_cl)/(n_cl*M_bar^2)*sum(df_cl$diff_sq)/(n_cl-1)
B_cl    <- 2*sqrt(var_ydb)

cat("=== Data Gerombol ===\n"); print(df_cl)
## === Data Gerombol ===
##   Sekolah m_i  y_i rata_i    diff_sq
## 1      S1  50 3500     70   1736.111
## 2      S2  20 1600     80  33611.111
## 3      S3 100 6500     65 340277.778
## 4      S4  30 2400     80  75625.000
## 5      S5  40 3000     75  27777.778
cat(sprintf("\nȳ̄ = Σy_i/Σm_i = %d/%d = %.2f\n",
            sum(df_cl$y_i),sum(df_cl$m_i),ydb))
## 
## ȳ̄ = Σy_i/Σm_i = 17000/240 = 70.83
cat(sprintf("Var(ȳ̄) = %.6f | SE = %.4f | B = %.4f\n",var_ydb,sqrt(var_ydb),B_cl))
## Var(ȳ̄) = 11.236454 | SE = 3.3521 | B = 6.7042
cat(sprintf("CI 95%%: [%.2f, %.2f]\n",ydb-B_cl,ydb+B_cl))
## CI 95%: [64.13, 77.54]
p1 <- ggplot(df_cl, aes(x=reorder(Sekolah,-m_i), y=rata_i, fill=Sekolah)) +
  geom_col(color="white",linewidth=.6,width=.6) +
  geom_hline(yintercept=ydb,linetype="dashed",color=pal["navy"],linewidth=1) +
  annotate("text",x=5,y=ydb+.8,label=sprintf("ȳ̄=%.1f",ydb),
           color=pal["navy"],hjust=1,size=4) +
  scale_fill_manual(values=unname(pal[1:5])) +
  labs(title="Rata-rata Nilai per Sekolah",x="Sekolah",y="Rata-rata Nilai") +
  th() + theme(legend.position="none")

p2 <- ggplot(df_cl, aes(x=Sekolah, y=m_i, fill=Sekolah)) +
  geom_col(color="white",linewidth=.6,width=.6) +
  scale_fill_manual(values=unname(pal[1:5])) +
  labs(title="Ukuran Gerombol (Jumlah Siswa)",x="Sekolah",y="mᵢ") +
  th() + theme(legend.position="none")

grid.arrange(p1, p2, ncol=2)

Soal P06.2 — Efisiensi Relatif vs ICC

Visualisasikan RE cluster sampling pada berbagai nilai \(\rho\) dan ukuran gerombol \(M\).

M_vals  <- c(5, 10, 20, 50)
rho_seq <- seq(-0.2, 0.8, by=0.01)

df_re <- expand.grid(M=M_vals, rho=rho_seq) |>
  mutate(RE=1/(1+(M-1)*rho), M_lab=paste0("M=",M))

ggplot(df_re, aes(x=rho, y=RE, color=factor(M), group=factor(M))) +
  geom_line(linewidth=1.2) +
  geom_hline(yintercept=1, linetype="dashed", color="#aaa") +
  geom_vline(xintercept=0, linetype="dotted", color="#ccc") +
  annotate("text",x=.65,y=1.08,label="RE=1 (≡ SRS)",color="#777",size=3.5) +
  annotate("rect",xmin=0,xmax=.8,ymin=-Inf,ymax=1,
           fill=pal["coral"],alpha=.07) +
  annotate("text",x=.4,y=.3,label="Cluster kurang\nefisien dari SRS",
           color=pal["coral"],size=3.5) +
  scale_color_manual(values=c(pal["sage"],pal["teal"],pal["amber"],pal["coral"])) +
  labs(title="Efisiensi Relatif (RE) Cluster vs SRS",
       x="Korelasi Intrakelas (ρ)", y="RE = Var(SRS)/Var(Cluster)",
       color="Ukuran Gerombol") + th()

Soal P06.3 — ANOVA untuk Menghitung ICC

Simulasikan dua kasus: gerombol homogen dalam (ICC tinggi) vs heterogen dalam (ICC rendah).

set.seed(2025)
N_cl2 <- 20; M2 <- 10

means_c <- rnorm(N_cl2, 70, 15)
pop1 <- unlist(lapply(means_c, function(m) rnorm(M2, m, 3)))
pop2 <- unlist(lapply(seq(60,80,length=N_cl2), function(m) rnorm(M2, 70, 12)))
grp  <- rep(1:N_cl2, each=M2)

hitung_icc <- function(y, grp){
  ms  <- summary(aov(y~factor(grp)))[[1]][,"Mean Sq"]
  MSB <- ms[1]; MSW <- ms[2]; M <- length(y)/length(unique(grp))
  rho <- (MSB-MSW)/(MSB+(M-1)*MSW)
  list(MSB=MSB, MSW=MSW, rho=rho, RE=1/(1+(M-1)*rho))
}

r1 <- hitung_icc(pop1, grp); r2 <- hitung_icc(pop2, grp)

cat("=== Kasus 1: Gerombol Homogen dalam, Heterogen antar ===\n")
## === Kasus 1: Gerombol Homogen dalam, Heterogen antar ===
cat(sprintf("MSB=%.1f | MSW=%.1f | ρ=%.4f | RE=%.4f → %s\n\n",
    r1$MSB,r1$MSW,r1$rho,r1$RE,
    ifelse(r1$RE<1,"KURANG EFISIEN dari SRS","LEBIH EFISIEN dari SRS")))
## MSB=2010.3 | MSW=8.8 | ρ=0.9577 | RE=0.1040 → KURANG EFISIEN dari SRS
cat("=== Kasus 2: Gerombol Heterogen dalam, Homogen antar ===\n")
## === Kasus 2: Gerombol Heterogen dalam, Homogen antar ===
cat(sprintf("MSB=%.1f | MSW=%.1f | ρ=%.4f | RE=%.4f → %s\n",
    r2$MSB,r2$MSW,r2$rho,r2$RE,
    ifelse(r2$RE<1,"KURANG EFISIEN dari SRS","LEBIH EFISIEN dari SRS")))
## MSB=95.4 | MSW=142.1 | ρ=-0.0340 | RE=1.4415 → LEBIH EFISIEN dari SRS
df_vis <- rbind(
  data.frame(nilai=pop1, gerombol=grp, kasus="Kasus 1: ICC Tinggi (ρ>0)"),
  data.frame(nilai=pop2, gerombol=grp, kasus="Kasus 2: ICC Rendah (ρ≈0)")
)
ggplot(df_vis, aes(x=factor(gerombol), y=nilai, color=kasus)) +
  geom_jitter(width=.2, alpha=.5, size=.8) +
  stat_summary(fun=mean, geom="point", shape=18, size=4, color=pal["navy"]) +
  facet_wrap(~kasus, scales="free") +
  scale_color_manual(values=c(pal["coral"],pal["teal"])) +
  labs(title="Struktur Gerombol dan Pengaruhnya pada ICC",
       subtitle="Berlian navy = rata-rata gerombol | Semakin seragam isi gerombol → ICC makin tinggi",
       x="Gerombol ke-", y="Nilai") +
  th() + theme(legend.position="none",
               axis.text.x=element_text(size=7),
               strip.text=element_text(face="bold"))


8 Ringkasan Komparasi Metode Sampling

8.1 Tabel Perbandingan Lengkap

tbl <- data.frame(
  Metode     = c("SRS","Stratified","Systematic","Cluster"),
  Frame      = c("Daftar unsur","Daftar unsur per strata",
                 "Daftar unsur terurut","Daftar gerombol"),
  Prinsip    = c("Semua sampel-n peluang sama",
                 "Homogen dalam, heterogen antar strata",
                 "Interval tetap dari start acak",
                 "Heterogen dalam, homogen antar gerombol"),
  Efisiensi  = c("Baseline","≥ SRS (bila strata relevan)",
                 "≥ SRS (terurut); ≤ SRS (periodik)",
                 "≤ SRS (bila ICC tinggi)"),
  Biaya      = c("Sedang","Tinggi","Rendah","Sangat Rendah"),
  Risiko     = c("—","Stratifikasi tidak relevan",
                 "Periodisitas = k","ICC tinggi")
)
knitr::kable(tbl, caption="Perbandingan Empat Metode Probability Sampling",
  col.names=c("Metode","Kebutuhan Frame","Prinsip Desain",
              "Efisiensi Relatif","Biaya","Risiko Utama"))
Perbandingan Empat Metode Probability Sampling
Metode Kebutuhan Frame Prinsip Desain Efisiensi Relatif Biaya Risiko Utama
SRS Daftar unsur Semua sampel-n peluang sama Baseline Sedang
Stratified Daftar unsur per strata Homogen dalam, heterogen antar strata ≥ SRS (bila strata relevan) Tinggi Stratifikasi tidak relevan
Systematic Daftar unsur terurut Interval tetap dari start acak ≥ SRS (terurut); ≤ SRS (periodik) Rendah Periodisitas = k
Cluster Daftar gerombol Heterogen dalam, homogen antar gerombol ≤ SRS (bila ICC tinggi) Sangat Rendah ICC tinggi

8.2 Simulasi Akhir: Perbandingan SE Empat Metode

set.seed(2025)
N <- 600; n <- 60; n_sim <- 2000

# Populasi dengan 3 strata (tiap 200 unit)
mu_strata <- c(40, 70, 100)
pop <- unlist(lapply(seq_along(mu_strata), function(h) rnorm(200, mu_strata[h], 8)))
strata_idx <- rep(1:3, each=200)
cluster_idx <- rep(1:60, each=10)  # 60 gerombol @10 unit
mu_pop <- mean(pop)

sim_SRS <- replicate(n_sim, mean(sample(pop, n)))

sim_STR <- replicate(n_sim, {
  ys <- tapply(seq_along(pop), strata_idx, function(idx)
    mean(pop[sample(idx, 20)]))
  sum(200/N * ys)
})

sim_SYS <- replicate(n_sim, {
  k_sys <- N/n; m <- sample(1:k_sys,1)
  mean(pop[seq(m, N, k_sys)])
})

sim_CLU <- replicate(n_sim, {
  cl_sel <- sample(1:60, 6)
  mean(pop[cluster_idx %in% cl_sel])
})

df_final <- rbind(
  data.frame(xbar=sim_SRS, metode="SRS"),
  data.frame(xbar=sim_STR, metode="Stratified"),
  data.frame(xbar=sim_SYS, metode="Systematic"),
  data.frame(xbar=sim_CLU, metode="Cluster")
)

ggplot(df_final, aes(x=xbar, fill=metode, color=metode)) +
  geom_density(alpha=.3, linewidth=1.1) +
  geom_vline(xintercept=mu_pop, color=pal["navy"], linetype="dashed", linewidth=1.2) +
  scale_fill_manual(values=c(pal["teal"],pal["sage"],pal["amber"],pal["coral"])) +
  scale_color_manual(values=c(pal["teal"],pal["sage"],pal["amber"],pal["coral"])) +
  labs(title="Distribusi Sampling ȳ: Perbandingan Empat Metode",
       subtitle="Garis navy = μ populasi | Distribusi lebih sempit = lebih efisien",
       x="Nilai ȳ", y="Densitas", fill="Metode", color="Metode") + th()

cat("\n=== SE Tiap Metode (Simulasi 2000 kali) ===\n")
## 
## === SE Tiap Metode (Simulasi 2000 kali) ===
se_tbl <- df_final |> group_by(metode) |>
  summarise(SE=round(sd(xbar),4), RE_vs_SRS=round(sd(sim_SRS)/sd(xbar),3))
print(as.data.frame(se_tbl))
##       metode     SE RE_vs_SRS
## 1    Cluster 9.7766     0.321
## 2        SRS 3.1360     1.000
## 3 Stratified 0.9636     3.254
## 4 Systematic 0.9116     3.440

8.3 Referensi

  • Cochran, W.G. (1977). Sampling Techniques (3rd ed.). Wiley.
  • Scheaffer, R.L., Mendenhall, W., & Ott, R.L. (2012). Elementary Survey Sampling (7th ed.). Cengage.
  • Lohr, S.L. (2021). Sampling: Design and Analysis (3rd ed.). CRC Press.
  • Sukhatme, P.V. (1984). Sampling Theory of Surveys with Applications. Iowa State University.

Seluruh kode R ini dibuat oleh Nida Khairunnissa dan dapat dijalankan ulang. Pastikan paket ggplot2, dplyr, tidyr, scales, gridExtra, dan knitr sudah terinstall.