Bootstrap pada Regresi dan Missing Value

Dalam analisis statistik, salah satu tantangan utama adalah menilai keandalan estimasi yang dihasilkan dari data. Namun, sering kali asumsi-asumsi klasik seperti normalitas residual, linearitas, atau homogenitas varians tidak sepenuhnya terpenuhi dalam data nyata. Ketika kondisi ini terjadi, metode statistik tradisional yang bergantung pada asumsi-asumsi tersebut menjadi kurang optimal. Di sinilah metode bootstrap menawarkan solusi yang fleksibel. Sebagai pendekatan non-parametrik, bootstrap tidak mengharuskan distribusi tertentu dari data dan memungkinkan kita untuk memperkirakan ketidakpastian, seperti galat baku (standard error) dan interval kepercayaan (confidence interval), dengan melakukan re-sampling terhadap data yang tersedia. Pendekatan ini sangat bermanfaat, terutama dalam konteks regresi maupun saat menghadapi masalah missing value, di mana metode klasik mungkin tidak lagi cukup andal.

Konsep Dasar Bootstrap

Bootstrap adalah teknik resampling, yaitu membuat banyak sampel baru dari sampel yang sudah ada (dengan pengambilan sampel ulang, dengan pengembalian).

Langkah-langkah:

– Ambil sampel acak dari data awal, ukuran sama, dengan pengembalian

– Hitung statistik yang diinginkan (misalnya koefisien regresi)

– Ulangi langkah 1–2 sebanyak R kali (misalnya 1000 kali)

– Gunakan distribusi hasil bootstrap untuk menghitung standar error atau confidence interval

Praktikum Dataset Simulasi

set.seed(4)

# Jumlah observasi
n <- 100

# Generate variabel x dari distribusi normal (mean=10, sd=2)
x <- rnorm(n, mean = 10, sd = 2)
# Generate variabel y dengan pola hubungan linear terhadap x plus error
y <- 3 + 1.5 * x + rnorm(n, mean = 0, sd = 2)

# Gabungkan menjadi data frame
data <- data.frame(x, y)

# Introduksi missing value secara acak pada 10 observasi x
data[sample(1:n, 10), "x"] <- NA

# Lihat 6 baris pertama
head(data)

##           x        y
## 1        NA 20.01987
## 2  8.915015 16.14230
## 3 11.782289 19.96048
## 4 11.191961 19.57640
## 5        NA 22.99662
## 6 11.378551 16.61548

Praktikum 1: Bootstrap untuk Regresi (tanpa missing)

# Hapus baris yang mengandung NA
clean_data <- na.omit(data)

# Fungsi untuk bootstrap regresi
boot_regression <- function(data, indices) {
  # Ambil sampel bootstrap sesuai indices
  d <- data[indices, ]
  # Fit model regresi linear
  model <- lm(y ~ x, data = d)
  # Return koefisien model
  return(coef(model))
}

# Load library boot
library(boot)

## Warning: package 'boot' was built under R version 4.4.3

# Lakukan bootstrap dengan 1000 replikasi
boot_result <- boot(
 data = clean_data,
 statistic = boot_regression,
 R = 1000
)

# Tampilkan hasil
boot_result

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = clean_data, statistic = boot_regression, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original       bias    std. error
## t1* 4.900169  0.037896455   1.1796225
## t2* 1.291943 -0.003597287   0.1127961

# Plot distribusi bootstrap
plot(boot_result)

# Hitung confidence interval 95% untuk koefisien x (index=2)
boot.ci(boot_result, type = "perc", index = 2)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_result, type = "perc", index = 2)
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 1.069,  1.514 )  
## Calculations and Intervals on Original Scale

Praktikum 2: Estimasi pada Missing Value dengan Bootstrap

# Hitung mean x (abaikan NA)
mean_x <- mean(data$x, na.rm = TRUE)

# Buat variabel baru dengan imputasi mean
data$ximp <- ifelse(is.na(data$x), mean_x, data$x)

# Fit model setelah imputasi
model_imp <- lm(y ~ ximp, data = data)
summary(model_imp)

## 
## Call:
## lm(formula = y ~ ximp, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.888 -1.532 -0.045  1.492  7.533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0560     1.3125   3.852 0.000209 ***
## ximp          1.2919     0.1277  10.115  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.238 on 98 degrees of freedom
## Multiple R-squared:  0.5108, Adjusted R-squared:  0.5058 
## F-statistic: 102.3 on 1 and 98 DF,  p-value: < 2.2e-16

# Fungsi bootstrap setelah imputasi
boot_imp <- function(data, indices) {
 d <- data[indices, ]
 model <- lm(y ~ ximp, data = d)
return(coef(model))
}

# Jalankan bootstrap
boot_result_imp <- boot(data = data, statistic = boot_imp, R = 1000)

# Hasil
boot_result_imp

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = boot_imp, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original       bias    std. error
## t1* 5.056018  0.005719605   1.1578607
## t2* 1.291943 -0.000120239   0.1108483

plot(boot_result_imp)

boot.ci(boot_result_imp, type = "perc", index = 2)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_result_imp, type = "perc", index = 2)
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 1.064,  1.516 )  
## Calculations and Intervals on Original Scale

Praktikum 3: Multiple Imputation + Bootstrap

library(mice)

## Warning: package 'mice' was built under R version 4.4.3

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

# Lakukan multiple imputation (m=5) dengan Predictive Mean Matching
imp <- mice(
 data[, c("x", "y")],
 m = 5,
 method = 'pmm',
 seed = 44
)

## 
##  iter imp variable
##   1   1  x
##   1   2  x
##   1   3  x
##   1   4  x
##   1   5  x
##   2   1  x
##   2   2  x
##   2   3  x
##   2   4  x
##   2   5  x
##   3   1  x
##   3   2  x
##   3   3  x
##   3   4  x
##   3   5  x
##   4   1  x
##   4   2  x
##   4   3  x
##   4   4  x
##   4   5  x
##   5   1  x
##   5   2  x
##   5   3  x
##   5   4  x
##   5   5  x

# Gabungkan dataset imputasi dalam long format
imp_data <- complete(imp, "long")

# Fit model di setiap dataset imputasi dan gabungkan hasilnya
model_mi <- with(imp, lm(y ~ x))
summary(pool(model_mi))

##          term estimate std.error statistic       df      p.value
## 1 (Intercept) 4.623083 1.1327217  4.081393 87.82467 9.836892e-05
## 2           x 1.325432 0.1092058 12.137015 88.13575 1.666816e-20

Hasil Gabungan

# Memastikan semua package sudah terinstall
library(mice)
library(broom)

## Warning: package 'broom' was built under R version 4.4.3

library(boot)

# 1. Model Data Lengkap
model_clean <- lm(y ~ x, data = clean_data)
clean_summary <- tidy(model_clean, conf.int = TRUE)

# 2. Model Mean Imputation + Bootstrap
# Asumsi boot_result_imp sudah dibuat sebelumnya
boot_ci <- boot.ci(boot_result_imp, type = "perc", index = 2)
boot_summary <- tidy(model_imp, conf.int = TRUE)

# 3. Model MICE
model_mice <- with(imp, lm(y ~ x))
mice_summary <- summary(pool(model_mice), conf.int = TRUE)

# Membuat data frame yang lebih robust
results_table <- data.frame(
 Metode = c("Data Lengkap", "Mean Imputation + Bootstrap", "MICE"),
 Intercept = c(
 clean_summary$estimate[1],
 boot_summary$estimate[1],
 mice_summary$estimate[1]
 ),
 Slope = c(
 clean_summary$estimate[2],
 boot_summary$estimate[2],
 mice_summary$estimate[2]
 ),
 SE_Slope = c(
 clean_summary$std.error[2],
 boot_summary$std.error[2],
 mice_summary$std.error[2]
 ),
 CI_Slope = c(
 sprintf("(%.3f, %.3f)", clean_summary$conf.low[2], clean_summary$conf.high[2]), sprintf("(%.3f, %.3f)", boot_ci$percent[4], boot_ci$percent[5]),
 sprintf("(%.3f, %.3f)", mice_summary$`2.5 %`[2], mice_summary$`97.5 %`[2]) ),
 stringsAsFactors = FALSE
)

# Tampilkan hasil
print(results_table)

##                        Metode Intercept    Slope  SE_Slope       CI_Slope
## 1                Data Lengkap  4.900169 1.291943 0.1128094 (1.068, 1.516)
## 2 Mean Imputation + Bootstrap  5.056018 1.291943 0.1277218 (1.064, 1.516)
## 3                        MICE  4.623083 1.325432 0.1092058 (1.108, 1.542)

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

# Data untuk plot
results <- data.frame(
 Method = c("Data Lengkap", "Mean Imp + Bootstrap", "MICE"),
 Slope = c(1.412127, 1.412127, 1.408248),
 SE = c(0.1079083, 0.1191314, 0.1068028 ),
 CI_lower = c(1.198, 1.188, 1.196),
 CI_upper = c(1.627, 1.603, 1.621)
)

ggplot(results, aes(x = Method, y = Slope, color = Method)) +
 geom_point(size = 3) +
 geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.2) +
 labs(title = "Perbandingan Estimasi Slope dengan Berbagai Metode",
 y = "Estimasi Slope (y ~ x)") +
 theme_minimal()

### Analisis Perbedaan Estimasi

Estimasi Slope (Koefisien x)

Konsistensi Nilai:

– Ketiga metode menghasilkan slope yang sangat mirip (~1.41)

– Perbedaan <0.004 (hanya 0.3% variasi)

– Indikasi bahwa pola missing tidak terlalu memengaruhi hubungan x-y

Perbedaan Kecil:

– MICE memberikan slope paling rendah (1.408)

– Data lengkap dan mean imputation sama (1.412)

Estimasi Intercept

Variasi Lebih Nyata:

– Rentang nilai: 3.581 (data lengkap) hingga 3.654 (mean imputation)

– Perbedaan ~0.073 (2% dari nilai intercept)

– Mean imputation menghasilkan intercept tertinggi, mungkin karena imputasi mean cenderung menarik intercept ke arah rata-rata dan sedikit bias karena pengisian nilai konstan.

Standard Error (SE) Slope

Konsistensi:

– SE data lengkap (0.108) vs MICE (0.107) sangat mirip

– Mean imputation memiliki SE lebih besar (0.119), mencerminkan ketidakpastian dari imputasi sederhana dan sesuai ekspektasi teori karena mean imputation meremehkan variabilitas

Confidence Interval (CI)

Lebar CI:

– Data lengkap: 0.429 (1.627-1.198)

– Mean imputation: 0.415 (1.603-1.188)

– MICE: 0.425 (1.621-1.196)

Pola Unik: Mean imputation memiliki CI paling sempit (bertentangan dengan teori), kemungkinan penyebab:

– Jumlah bootstrap tidak cukup (misal hanya 100 iterasi)

– Missing values sedikit (<10%) sehingga dampak imputasi minimal

– Data missing completely at random (MCAR)