1. Introduction

In this practical, we explore computer-intensive methods that approximate sampling distributions
using only the observed data and resampling:

Bootstrap Methods

  • Resampling with replacement
  • Estimation of bias, standard error, confidence intervals
  • Bootstrapping in regression

Jackknife and Cross-Validation

  • Jackknife for variance estimation (with a sample survey flavour)
  • Jackknife for a ratio estimator
  • Cross-validation for tuning parameters in predictive models

We will use both simulated data and real data (built-in R datasets like mtcars).


2. Bootstrap Methods

2.1 Brief Theory

Suppose we observe i.i.d. data
\[ X_1, X_2, \ldots, X_n \sim F \]
from an unknown distribution \(F\).

We want the sampling distribution of a statistic
\[ T = T(X_1, X_2, \ldots, X_n) \]
(e.g., mean, median, regression coefficient).

Bootstrap Idea

  • Estimate \(F\) by the empirical distribution \(\hat{F}_n\), which puts mass \(1/n\) on each observation.
  • Draw bootstrap samples
    \[ X_1^*, X_2^*, \ldots, X_n^* \sim \hat{F}_n \quad (\text{with replacement}) \]
  • Recompute the statistic for each resample:
    \[ T^* = T(X_1^*, X_2^*, \ldots, X_n^*) \]
  • Repeat many times (say \(B = 1000\)) to approximate the sampling distribution of \(T\).

From the bootstrap replicates \((T_1^*, T_2^*, \ldots, T_B^*)\), we estimate:

Bootstrap Standard Error (SE): \[ \hat{SE}_{boot}(T) = \sqrt{\frac{1}{B-1} \sum_{b=1}^B (T_b^* - \bar{T}^*)^2} \]

Bootstrap Bias: \[ \hat{Bias}_{boot}(T) = \bar{T}^* - T_{obs} \]

Bootstrap Confidence Intervals (e.g., Percentile CI):
Take empirical quantiles of \(T_b^*\).


2.2 Simulation Example: Bootstrap for Mean and Median

We simulate from a skewed distribution (Exponential), where the median is non-linear and
standard formulae for standard errors are less straightforward.

# Simulate data from an exponential distribution

n <- 100
lambda <- 1  # rate
x <- rexp(n, rate = lambda)

mean_obs   <- mean(x)
median_obs <- median(x)

mean_obs
## [1] 1.045719
median_obs
## [1] 0.847754

2.2.1 Nonparametric Bootstrap for Mean and Median

B <- 2000

boot_mean <- numeric(B)
boot_median <- numeric(B)

for (b in 1:B) {
x_star <- sample(x, size = n, replace = TRUE)
boot_mean[b]   <- mean(x_star)
boot_median[b] <- median(x_star)
}

# Bootstrap estimates

se_mean_boot   <- sd(boot_mean)
se_median_boot <- sd(boot_median)

bias_mean_boot   <- mean(boot_mean)   - mean_obs
bias_median_boot <- mean(boot_median) - median_obs

list(
mean_obs        = mean_obs,
se_mean_boot    = se_mean_boot,
bias_mean_boot  = bias_mean_boot,
median_obs      = median_obs,
se_median_boot  = se_median_boot,
bias_median_boot= bias_median_boot
)
## $mean_obs
## [1] 1.045719
## 
## $se_mean_boot
## [1] 0.102229
## 
## $bias_mean_boot
## [1] 0.00116082
## 
## $median_obs
## [1] 0.847754
## 
## $se_median_boot
## [1] 0.1403908
## 
## $bias_median_boot
## [1] 0.007361685

2.2.2 Bootstrap Confidence Intervals (Percentile Method)

alpha <- 0.05

ci_mean   <- quantile(boot_mean,   probs = c(alpha/2, 1 - alpha/2))
ci_median <- quantile(boot_median, probs = c(alpha/2, 1 - alpha/2))

ci_mean
##      2.5%     97.5% 
## 0.8668418 1.2564491
ci_median
##      2.5%     97.5% 
## 0.5826475 1.0741226

2.2.3 Visualizing Bootstrap Distributions

par(mfrow = c(1,2))

hist(boot_mean, breaks = 30, main = "Bootstrap Mean",
xlab = "Mean*", probability = TRUE)
abline(v = mean_obs, col = "red", lwd = 2)

hist(boot_median, breaks = 30, main = "Bootstrap Median",
xlab = "Median*", probability = TRUE)
abline(v = median_obs, col = "red", lwd = 2)

par(mfrow = c(1,1))

2.3 Real Data Example: Bootstrapping Regression Coefficients (mtcars)

We now use the classic mtcars dataset.
We model:

\[ mpg = \beta_0 + \beta_1 \cdot wt + \beta_2 \cdot hp + \varepsilon \]

where: - \(mpg\) = miles per gallon (response variable)
- \(wt\) = weight of the car
- \(hp\) = horsepower
- \(\varepsilon\) = error term

Bootstrap Approach

  • Resample the rows of the regression dataset with replacement.
  • Fit the regression model for each bootstrap sample.
  • Collect the bootstrap replicates of the coefficients \((\beta_0^*, \beta_1^*, \beta_2^*)\).
  • Use these replicates to approximate the sampling distribution of the regression coefficients.

This allows us to estimate: - Bootstrap SEs for each coefficient
- Bias of the estimates
- Confidence intervals (e.g., percentile intervals) for \(\beta_0, \beta_1, \beta_2\)

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
fit_lm <- lm(mpg ~ wt + hp, data = mtcars)
summary(fit_lm)
## 
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12
coef_obs <- coef(fit_lm)
coef_obs
## (Intercept)          wt          hp 
## 37.22727012 -3.87783074 -0.03177295

2.3.1 Bootstrap the Regression (Case Resampling)

B <- 2000
n <- nrow(mtcars)

boot_coefs <- matrix(NA, nrow = B, ncol = length(coef_obs))
colnames(boot_coefs) <- names(coef_obs)

for (b in 1:B) {
idx <- sample(1:n, size = n, replace = TRUE)
dat_star <- mtcars[idx, ]
fit_star <- lm(mpg ~ wt + hp, data = dat_star)
boot_coefs[b, ] <- coef(fit_star)
}

boot_se <- apply(boot_coefs, 2, sd)
boot_bias <- colMeans(boot_coefs) - coef_obs

boot_se
## (Intercept)          wt          hp 
## 2.133899525 0.712481184 0.007534983
boot_bias
##   (Intercept)            wt            hp 
##  0.0507673972 -0.0127911937 -0.0008579728

2.3.2 Bootstrap Confidence Intervals for Regression Coefficients

alpha <- 0.05
boot_ci <- apply(boot_coefs, 2, quantile, probs = c(alpha/2, 1 - alpha/2))

boot_ci
##       (Intercept)        wt          hp
## 2.5%     33.04878 -5.333170 -0.04910424
## 97.5%    41.56600 -2.488106 -0.01991454

We can compare these with the usual t-based confidence intervals from summary(lm).

3. Jackknife and Cross-Validation

3.1 Jackknife: Brief Theory

Jackknife is another resampling method, based on systematic deletion.

Given a statistic: \[ T = T(X_1, X_2, \ldots, X_n) \]

For each \(i = 1, \ldots, n\), remove \(X_i\) and recompute the statistic: \[ T^{(i)} = T(X_1, \ldots, X_{i-1}, X_{i+1}, \ldots, X_n) \]

Define the jackknife average: \[ \bar{T}^{(\cdot)} = \frac{1}{n} \sum_{i=1}^n T^{(i)} \]

Jackknife variance estimate: \[ \widehat{\text{Var}}_{JK}(T) = \frac{n-1}{n} \sum_{i=1}^n \left( T^{(i)} - \bar{T}^{(\cdot)} \right)^2 \]

Notes

  • In sample surveys, jackknife is often applied to complex estimators (e.g., ratios, weighted means, totals).
  • Each PSU (Primary Sampling Unit) or cluster can be treated as a jackknife unit.

3.2 Simulation: Jackknife Variance of a Sample Mean (Finite Population View)

  • Construct a finite population (e.g., 500 units).
  • Draw a Simple Random Sample (SRS).
  • Estimate the population mean.
  • Apply jackknife to estimate the variance of the mean.
# Finite population of size N

N <- 500
pop_y <- rnorm(N, mean = 50, sd = 10)  # e.g. income-like variable

# True population mean

mu_pop <- mean(pop_y)
mu_pop
## [1] 50.19899
# Draw a simple random sample without replacement (survey style)

n <- 50
s_idx <- sample(1:N, size = n, replace = FALSE)
y_s <- pop_y[s_idx]

# Sample mean (estimator of population mean)

t_hat <- mean(y_s)
t_hat
## [1] 48.21283

3.2.1 Jackknife for the Mean

# Jackknife leave-one-out estimates

t_jack <- numeric(n)

for (i in 1:n) {
t_jack[i] <- mean(y_s[-i])
}

t_bar_dot <- mean(t_jack)

# Jackknife variance

var_jack <- (n - 1) / n * sum((t_jack - t_bar_dot)^2)
se_jack  <- sqrt(var_jack)

list(
sample_mean = t_hat,
jackknife_se = se_jack
)
## $sample_mean
## [1] 48.21283
## 
## $jackknife_se
## [1] 1.457039

We can compare with the usual SRS variance estimator:

# Classical SE of mean for SRSWOR (finite population correction)

s2 <- var(y_s)
fpc <- (N - n) / (N - 1)
se_srs <- sqrt(s2 / n * fpc)

se_srs
## [1] 1.383653

3.3 Jackknife for a Ratio Estimator (Survey Flavour)

Consider a finite population with two variables:
- \(Y\) (e.g., total expenditure)
- \(X\) (e.g., household size)

We want to estimate the population ratio: \[ R = \frac{\sum Y}{\sum X} \]

Jackknife Procedure

  1. Draw a sample from the finite population.
  2. Compute the ratio estimator: \[ \hat{R} = \frac{\sum_{i=1}^n Y_i}{\sum_{i=1}^n X_i} \]
  3. For each \(i = 1, \ldots, n\), delete the \(i\)-th unit and recompute the ratio: \[ \hat{R}^{(i)} = \frac{\sum_{j \neq i} Y_j}{\sum_{j \neq i} X_j} \]
  4. Compute the jackknife average: \[ \bar{R}^{(\cdot)} = \frac{1}{n} \sum_{i=1}^n \hat{R}^{(i)} \]
  5. Estimate the jackknife variance: \[ \widehat{\text{Var}}_{JK}(\hat{R}) = \frac{n-1}{n} \sum_{i=1}^n \left( \hat{R}^{(i)} - \bar{R}^{(\cdot)} \right)^2 \]

Notes

  • This approach is widely used in survey sampling, especially for complex estimators like ratios, weighted means, or totals.
  • Each sampled unit (or PSU/cluster in complex surveys) can serve as a jackknife unit.
# Finite population with X and Y

set.seed(123)
N <- 500
X_pop <- rpois(N, lambda = 5) + 1          # e.g. household size
Y_pop <- 100 + 10 * X_pop + rnorm(N, 0, 20)  # expenditure

R_pop <- sum(Y_pop) / sum(X_pop)
R_pop
## [1] 26.72235

Take an SRSWOR sample and compute the ratio estimator:

n <- 60
s_idx <- sample(1:N, n, replace = FALSE)
X_s <- X_pop[s_idx]
Y_s <- Y_pop[s_idx]

R_hat <- sum(Y_s) / sum(X_s)
R_hat
## [1] 26.32323

3.3.1 Jackknife Variance for the Ratio Estimator

R_jack <- numeric(n)

for (i in 1:n) {
X_minus_i <- X_s[-i]
Y_minus_i <- Y_s[-i]
R_jack[i] <- sum(Y_minus_i) / sum(X_minus_i)
}

R_bar_dot <- mean(R_jack)

var_R_jack <- (n - 1) / n * sum((R_jack - R_bar_dot)^2)
se_R_jack  <- sqrt(var_R_jack)

list(
R_hat      = R_hat,
jackknife_se_R = se_R_jack
)
## $R_hat
## [1] 26.32323
## 
## $jackknife_se_R
## [1] 0.8731723

This mimics the use of jackknife in sample survey analysis for complex, nonlinear estimators.

3.4 Cross-Validation: Brief Theory

Cross-validation (CV) is used primarily for model selection and tuning hyperparameters.
The most common approach is K-fold cross-validation.

Procedure

  1. Split the data into \(K\) roughly equal folds.
  2. For each fold \(k\):
    • Fit the model on the remaining \(K-1\) folds (training set).
    • Evaluate prediction error on fold \(k\) (validation set).
  3. Average the validation errors over all \(K\) folds.
  4. Use the total CV error to choose hyperparameters (e.g., polynomial degree, regularization parameter, number of neighbours).

This provides a robust estimate of model performance and helps prevent overfitting.


3.5 Simulation: Choosing Polynomial Degree via Cross-Validation

We simulate data from a quadratic model, but pretend we do not know the true degree.

Steps

  • Generate data from a quadratic function with added noise.
  • Fit polynomial regression models of varying degrees (e.g., 1, 2, 3, 4).
  • Use K-fold cross-validation to compute prediction error for each degree.
  • Select the degree with the lowest CV error as the best model.

Notes

  • This illustrates how CV can guide model complexity selection.
  • Even though the true model is quadratic, CV allows us to discover this empirically.
set.seed(2024)

n <- 200
x <- runif(n, -2, 2)
y <- 1 + 2 * x - 3 * x^2 + rnorm(n, sd = 1)

plot(x, y, main = "Simulated Data (Quadratic Relationship)")

## We compare polynomial degrees 1 to 5 using 10-fold CV.

K <- 10
folds <- sample(rep(1:K, length.out = n))

degrees <- 1:5
cv_errors <- numeric(length(degrees))

for (d in degrees) {
fold_mse <- numeric(K)

for (k in 1:K) {
test_idx  <- which(folds == k)
train_idx <- setdiff(1:n, test_idx)


x_train <- x[train_idx]
y_train <- y[train_idx]
x_test  <- x[test_idx]
y_test  <- y[test_idx]

# Fit polynomial of degree d
fit_d <- lm(y_train ~ poly(x_train, degree = d, raw = TRUE))

y_pred <- predict(fit_d, newdata = data.frame(x_train = x_test))
fold_mse[k] <- mean((y_test - y_pred)^2)


}

cv_errors[d] <- mean(fold_mse)
}

data.frame(
degree = degrees,
cv_mse = cv_errors
)
##   degree     cv_mse
## 1      1 14.7734847
## 2      2  0.9378318
## 3      3  0.8953893
## 4      4  0.8965147
## 5      5  0.9192455
plot(degrees, cv_errors, type = "b", xlab = "Polynomial Degree",
ylab = "10-fold CV MSE", main = "Choosing Degree via Cross-Validation")

We expect degree 2 (or nearby) to perform best, reflecting the true underlying model.

3.6 Real Data: Cross-Validation for Regression Tuning (mtcars)

We illustrate K-fold cross-validation (CV) for choosing whether to include a quadratic term in regression.

Models

  • Model 1: \[ mpg \sim wt + hp \]

  • Model 2: \[ mpg \sim wt + hp + I(wt^2) \]

Procedure

  1. Split the mtcars dataset into \(K\) folds (e.g., \(K = 5\) or \(10\)).
  2. For each fold:
    • Fit the regression model on the training folds.
    • Compute prediction error on the validation fold.
  3. Average the prediction errors across folds for each model.
  4. Compare the total CV errors of Model 1 and Model 2.
  5. Select the model with the lower CV error as the better specification.

Notes

  • This example demonstrates how CV can guide model specification decisions.
  • Including a quadratic term (\(wt^2\)) allows us to test whether a nonlinear relationship between weight and mpg improves predictive performance.
data(mtcars)

y <- mtcars$mpg
X1 <- mtcars[, c("wt", "hp")]
X2 <- data.frame(wt = mtcars$wt,
hp = mtcars$hp,
wt2 = mtcars$wt^2)

n <- nrow(mtcars)
K <- 5
set.seed(42)
folds <- sample(rep(1:K, length.out = n))

cv_err_m1 <- numeric(K)
cv_err_m2 <- numeric(K)

for (k in 1:K) {
test_idx  <- which(folds == k)
train_idx <- setdiff(1:n, test_idx)

# Model 1

fit1 <- lm(y ~ ., data = data.frame(y = y[train_idx], X1[train_idx, ]))
pred1 <- predict(fit1, newdata = X1[test_idx, ])
cv_err_m1[k] <- mean((y[test_idx] - pred1)^2)

# Model 2

fit2 <- lm(y ~ ., data = data.frame(y = y[train_idx], X2[train_idx, ]))
pred2 <- predict(fit2, newdata = X2[test_idx, ])
cv_err_m2[k] <- mean((y[test_idx] - pred2)^2)
}

mean_cv_m1 <- mean(cv_err_m1)
mean_cv_m2 <- mean(cv_err_m2)

c(
mean_cv_m1 = mean_cv_m1,
mean_cv_m2 = mean_cv_m2
)
## mean_cv_m1 mean_cv_m2 
##   7.887135   5.320388

If mean_cv_m2 < mean_cv_m1, the model with the quadratic term has better predictive performance according to cross-validation.

4. Summary

Bootstrap

  • Uses resampling with replacement from the observed data to approximate the sampling distribution.
  • Provides estimates of bias, standard error, and confidence intervals.
  • Works naturally with regression and complex statistics.

Jackknife

  • Based on systematic deletion (often delete-one or delete-PSU) to estimate variance and bias.
  • Widely used in sample survey contexts for nonlinear estimators.

Cross-Validation

  • Splits data into training and validation sets to estimate prediction error.
  • Used to tune hyperparameters and compare models (e.g., polynomial degree, model complexity).

Overall

These methods together form a powerful practical toolkit for modern data analysis,
especially when analytic formulae are difficult or unreliable.