1 Set Up

# Clear the workspace
  rm(list = ls())  # Clear environment
  gc()             # Clear unused memory
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 525472 28.1    1166702 62.4         NA   669291 35.8
## Vcells 966904  7.4    8388608 64.0      32768  1840401 14.1
  cat("\f")        # Clear the console
  if(!is.null(dev.list())) dev.off() # Clear all plots
## null device 
##           1
# Install and load the 'boot' package if not already installed
# install.packages("boot")
library(boot)
library(magrittr)
library(ggplot2)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## The following object is masked from 'package:boot':
## 
##     logit

2 Bootstrapping

Bootstrapping is a resampling technique used to estimate the sampling distribution of a statistic by repeatedly resampling with replacement from the observed data.

Bootstrapping for the mean is a resampling technique used to estimate the sampling distribution of the mean of a dataset by repeatedly resampling with replacement from the observed data. The primary goal is to infer properties of the population mean without assuming a specific distribution for the data.

Here’s a step-by-step explanation of bootstrapping for the mean:

  1. Original Data:

    • Start with a dataset of observed values (e.g., a sample of data).
  2. Resampling:

    • Randomly draw samples with replacement from the original dataset. Each sample is of the same size as the original dataset.
  3. Statistic Calculation:

    • Calculate the mean for each of the bootstrap samples.
  4. Repeat:

    • Repeat steps 2 and 3 a large number of times (e.g., thousands of times) to create a distribution of sample means.
  5. Analysis:

    • Analyze the distribution of sample means to estimate properties such as the mean, standard deviation, and confidence intervals.

Bootstrapping is particularly useful when the underlying distribution of the data is unknown or complex, as it does not rely on specific distributional assumptions. It provides a way to empirically estimate the variability of a statistic and construct confidence intervals without making strong parametric assumptions.

Here’s a simple example

2.1 Original Data

# Sample Data

set.seed(123)

my_data <- rnorm(100, 
                 mean = 10, 
                 sd = 2
                 )

describe(my_data)
##    vars   n  mean   sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 100 10.18 1.83  10.12   10.16 1.78 5.38 14.37  8.99 0.06    -0.22 0.18
ggplot(mapping = aes(x = my_data)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.2 Resampling, Statistic Calculation and Repeat

Lets first see how the boot command works -

# Define the statistic of interest (mean)
    statistic_function <- function(x, indices) {
      sample_mean <- mean(x[indices])
      return(sample_mean)
    }
  
# Perform bootstrapping
    results <- boot(data = my_data, 
                    statistic = statistic_function, 
                    R = 7
                    )
  • statistic_function: This is a user-defined function that calculates the statistic of interest (in this case, the mean) for each bootstrap sample. The indices parameter is crucial here.

Inside statistic_function:

  • x: This represents the dataset (in our case, data).

  • indices: This parameter is automatically provided by the boot package. It contains the indices of the elements that were sampled with replacement for the current bootstrap iteration. In other words, x[indices] corresponds to a resampled version of the original dataset.

By using indices, we ensure that each time the statistic (mean) is calculated for a bootstrap sample, it’s computed based on a subset of the original data formed by the resampled indices. This mimics the process of drawing samples with replacement, which is a key aspect of bootstrapping. The boot function takes care of calling statistic_function with appropriate resampling each time.

So, in summary, indices helps in creating bootstrap samples by specifying which elements from the original dataset should be included in the current resampled dataset for the calculation of the mean.

2.3 Implement

# Bootstrap Function for the Mean
bootstrap_mean_ci <- function(data, num_iterations = 1000) {
  
    # Define the statistic of interest (mean)
    statistic_function <- function(x, indices) {
      sample_mean <- mean(x[indices])
      return(sample_mean)
    }
  
    # Perform bootstrapping
    results <- boot(data = data, 
                    statistic = statistic_function, 
                    R = num_iterations
                    )
  
    # Calculate the 95% confidence interval
    ci <- quantile(x = results$t, 
                   c(0.025, 0.975)
                   )
  
  # Return a list containing the estimated mean and its confidence interval
  return(list(mean = mean(results$t), 
              ci = ci)
         )
}

2.4 Analysis

# Test the function
result <- bootstrap_mean_ci(my_data)

cat("Bootstrap Estimated Mean:", result$mean, "\n")
## Bootstrap Estimated Mean: 10.1851
cat("Bootstrap 95% Confidence Interval:", result$ci[1], "to", result$ci[2], "\n")
## Bootstrap 95% Confidence Interval: 9.839857 to 10.53614

3 Bootstrapping and the Central Limit Theorem (CLT)

Bootstrapping and the Central Limit Theorem (CLT) are distinct statistical concepts, but they are related in that they both deal with the distribution of sample statistics. Let’s explore the key differences between bootstrapping and the CLT:

  1. Bootstrapping:

    • Nature: Bootstrapping is a resampling technique.

    • Purpose: The primary goal of bootstrapping is to estimate the sampling distribution of a statistic (such as the mean, variance, etc.) by repeatedly resampling with replacement from the observed data.

    • Assumptions: Bootstrapping does not assume any specific distribution for the population. It is a non-parametric method.

    • Application: Bootstrapping is particularly useful when the underlying distribution of the data is unknown or complex. It provides empirical estimates of standard errors, confidence intervals, and other statistical properties.

  2. Central Limit Theorem (CLT):

    • Nature: The CLT is a theoretical concept.

    • Purpose: The CLT describes the behavior of the distribution of sample means from a population. It states that as the sample size increases, the distribution of sample means approaches a normal distribution, regardless of the shape of the original population distribution.

    • Assumptions: The CLT assumes that the population from which the samples are drawn has a finite mean and variance. It is applicable when the sample size is sufficiently large.

    • Application: The CLT is often invoked when making inferences about the mean of a population. It explains why the normal distribution is frequently used in statistical inference.

3.1 Key Differences:

  • Methodology: Bootstrapping involves resampling from the observed data to create new datasets, while the CLT is a theoretical framework that describes the behavior of sample means from a population.

  • Distribution: Bootstrapping provides an empirical estimate of the distribution of a statistic, whereas the CLT describes the asymptotic distribution of sample means.

  • Assumptions: Bootstrapping is non-parametric and makes minimal assumptions about the underlying distribution, while the CLT assumes a finite mean and variance in the population.

In summary, bootstrapping is a practical method for estimating the distribution of a statistic from observed data, without assuming a specific distribution. The CLT, on the other hand, is a theoretical framework explaining the behavior of sample means, especially as sample size increases. While they address related aspects of statistical inference, they serve different purposes and operate in different ways.

# CLT
# CLT Example with Random Exponential Data
set.seed(123)
n_samples <- 1000
n_obs_per_sample <- 30
exponential_data <- rexp(n_samples * n_obs_per_sample, rate = 0.1)
sample_means <- matrix(exponential_data, ncol = n_obs_per_sample) %>% rowMeans()

# Plot the Distribution of Sample Means
hist(x = sample_means, 
     breaks = 30, 
     prob = TRUE, 
     main = "CLT Example", 
     xlab = "Sample Means"
     )

curve(expr = dnorm(x = x, 
                   mean = mean(sample_means), 
                   sd   = sd(sample_means)
                   ), 
      add = TRUE, 
      col = "blue", 
      lwd = 2
      )

summary(sample_means)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.570   8.714   9.921  10.026  11.233  16.840