Activity09 - Bootstrapping

Erick Anangwe 07-13-2024

Task 2: Load the necessary packages

Task 3: Create the data

  # Set a random seed value so we can obtain the same "random" results
  set.seed(2023)

  # Create a data frame/tibble named sim_dat
  sim_dat <- tibble(
  # Explain what next line is doing
    x1 = runif(20, -5, 5),
  # Explain what next line is doing
    x2 = runif(20, 0, 100),
  # Explain what next line is doing
    x3 = rbinom(20, 1, 0.5)
    )

  b0 <- 2
  b1 <- 0.25
  b2 <- -0.5
  b3 <- 1
  sigma <- 1.5

  errors <- rnorm(20, 0, sigma)

  sim_dat <- sim_dat %>% 
    mutate(
      y = b0 + b1*x1 + b2*x2 + b3*x3 + errors,
      x3 = case_when(
        x3 == 0 ~ "No",
        TRUE ~ "Yes"
        )
      )

Let’s complete the following tasks:

Explain what each line of code is doing by providing a comment.

# Set a random seed value so we can obtain the same "random" results
set.seed(2023)

# Create a data frame/tibble named sim_dat
sim_dat <- tibble(
  # Generate 20 random numbers between -5 and 5 for variable x1
  x1 = runif(20, -5, 5),
  # Generate 20 random numbers between 0 and 100 for variable x2
  x2 = runif(20, 0, 100),
  # Generate 20 random binary values (0 or 1) with a probability of 0.5 for variable x3
  x3 = rbinom(20, 1, 0.5)
)

# Define the true coefficients for the linear model
b0 <- 2   # Intercept
b1 <- 0.25  # Coefficient for x1
b2 <- -0.5  # Coefficient for x2
b3 <- 1  # Coefficient for x3
sigma <- 1.5  # Standard deviation of the errors

# Generate random errors with mean 0 and standard deviation 1.5
errors <- rnorm(20, 0, sigma)

# Compute the response variable y using the true model and add it to the dataset
sim_dat <- sim_dat %>% 
  mutate(
    y = b0 + b1*x1 + b2*x2 + b3*x3 + errors,  # Calculate y based on the linear model and errors
    x3 = case_when(
      x3 == 0 ~ "No",  # Convert 0 values of x3 to "No"
      TRUE ~ "Yes"     # Convert 1 values of x3 to "Yes"
    )
  )

# Display the first few rows of the dataset
head(sim_dat)

## # A tibble: 6 × 4
##       x1    x2 x3        y
##    <dbl> <dbl> <chr> <dbl>
## 1 -0.334  62.8 No    -29.9
## 2 -1.65   43.8 Yes   -17.4
## 3 -3.37   34.0 No    -17.1
## 4 -1.04   93.3 No    -45.0
## 5 -4.70   71.3 Yes   -34.8
## 6 -3.79   63.1 Yes   -30.1

What is the true (population-level) model? Note that we are adding noise/variability, but based on the above code we can see what the “baseline” model is.

The true (population-level) model is the baseline model without the added noise (errors). Based on the given code, the true model is a linear combination of the variables x1, x2, and x3 with their respective coefficients, plus an intercept term. The true model is as follows:
y = b0 + b1.x1 + b2.x2 + b3.x3
Substituting the coefficients from the code:
y = 2 + 0.25.x1 - 0.5.x2 + 1.x3

Where:

b0 = 2 (intercept)

b1 = 0.25 (coefficient for x1)

b2 = −0.5 (coefficient for x2)

b3 = 1 (coefficient for x3)

The variable x3 is binary and represented as 0 (No) or 1 (Yes). The noise (errors) added in the actual model simulations represents random variability around this true model.

Let’s create graphical visualizations for the relationship between all variable pairs (i.e., y and each x and also each x pair) and provide a brief summary of what we see/notice.

To create graphical visualizations for the relationships between all variable pairs, we can use the GGally package, which extends ggplot2 to create pair plots easily.

# Load the necessary package
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# Create a pair plot to visualize relationships between all variables
pair_plot <- ggpairs(sim_dat, columns = 1:4, aes(color = x3))

# Display the pair plot
pair_plot

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There appears to be a positive linear relationship between y and x1. This aligns with the true model where x1 has a positive coefficient of 0.25.

There is a negative linear relationship between y and x2. This is consistent with the true model where x2 has a negative coefficient of -0.5.

Task 4: Traditional MLR model

First we will fit an estimated model to our simulated data.

# Fit the multiple linear regression model
mlr_fit <- linear_reg() %>%
  set_mode("regression") %>% 
  set_engine("lm") %>% 
  fit(y ~ x1 + x2 + x3, data = sim_dat)

# Include the confidence intervals for the estimated slope parameters
mlr_results <- tidy(mlr_fit, conf.int = TRUE)

# Display the results
mlr_results

## # A tibble: 4 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    1.94     0.756       2.57 2.06e- 2    0.339     3.55 
## 2 x1             0.335    0.0891      3.76 1.72e- 3    0.146     0.524
## 3 x2            -0.495    0.0108    -45.7  2.22e-18   -0.518    -0.472
## 4 x3Yes          1.81     0.509       3.56 2.59e- 3    0.735     2.89

Let’s answer the following question:

Looking at our population-level model from (2), how accurate are our results?

Detailed Evaluation:

Intercept (b0):
True value: 2
Estimated value: 1.942
Confidence Interval: [0.339, 3.545]
Assessment: The estimate (1.942) is very close to the true value (2), and the true value lies within the confidence interval. This indicates that the model is accurately estimating the intercept.

x1 (b1):
True value: 0.25
Estimated value: 0.335
Confidence Interval: [0.146, 0.524]
Assessment: The estimate (0.335) is slightly higher than the true value (0.25), but the true value is within the confidence interval. This suggests that the model is fairly accurate for x1.

x2 (b2):
True value: -0.5
Estimated value: -0.495
Confidence Interval: [-0.518, -0.472]
Assessment: The estimate (-0.495) is very close to the true value (-0.5), and the true value falls within the confidence interval. This shows high accuracy for x2.

x3 (b3):
True value: 1
Estimated value: 1.814
Confidence Interval: [0.735, 2.894]
Assessment: The estimate (1.814) is higher than the true value (1), but the true value is within the confidence interval. This indicates some variability but still reasonable accuracy for x3.

Conclusion:
Overall, the MLR model provides estimates that are quite close to the true population-level coefficients. Each true coefficient lies within the respective confidence intervals of the estimates, indicating that the model is accurately capturing the underlying relationships despite the added noise. The standard errors and p-values also support the statistical significance of the estimates. This evaluation confirms that the model is reliable and accurate in estimating the true coefficients.

Task 5: Bootstrapping

Now, bootstrapping treats our sample of data as a psuedo-population that we will create multiple new samples from. Each sample will be of the same size and same number of variables as the original (\(n = 20; p = 3\)). However, once a row has been sampled, it will be put back into the “population” (i.e., sampling with replacement). With computing power being relatively cheap now, we can do this for a large number of times to build up \(B\) bootstrap samples. For example, we might do \(B = 2,000\) bootstrap samples each of \(n = 20\).

# Set a random seed value so we can obtain the same "random" results
set.seed(631)

# Generate the 2000 bootstrap samples
boot_samps <- sim_dat %>% 
  bootstraps(times = 2000)

# View the structure of the bootstrap samples
boot_samps

## # Bootstrap sampling 
## # A tibble: 2,000 × 2
##    splits          id           
##    <list>          <chr>        
##  1 <split [20/8]>  Bootstrap0001
##  2 <split [20/6]>  Bootstrap0002
##  3 <split [20/6]>  Bootstrap0003
##  4 <split [20/6]>  Bootstrap0004
##  5 <split [20/10]> Bootstrap0005
##  6 <split [20/10]> Bootstrap0006
##  7 <split [20/7]>  Bootstrap0007
##  8 <split [20/6]>  Bootstrap0008
##  9 <split [20/8]>  Bootstrap0009
## 10 <split [20/6]>  Bootstrap0010
## # ℹ 1,990 more rows

Now we need to fit a linear model to each bootstrap sample.
We will use some of the handy iteration functions from {purrr}.

# Create a function that fits a fixed MLR model to one split dataset
fit_mlr_boots <- function(split) {
  lm(y ~ x1 + x2 + x3, data = analysis(split))
}

# Fit the model to each split and store the information
# Also, obtain the tidy model information
boot_models <- boot_samps %>% 
  mutate(
    model = map(splits, fit_mlr_boots),
    coef_info = map(model, tidy)
  )

# Unnest the coefficient information from the models
boots_coefs <- boot_models %>% 
  unnest(coef_info)

# Display the first few rows of the bootstrap coefficients
head(boots_coefs)

## # A tibble: 6 × 8
##   splits         id            model term  estimate std.error statistic  p.value
##   <list>         <chr>         <lis> <chr>    <dbl>     <dbl>     <dbl>    <dbl>
## 1 <split [20/8]> Bootstrap0001 <lm>  (Int…    1.78    0.615        2.90 1.05e- 2
## 2 <split [20/8]> Bootstrap0001 <lm>  x1       0.199   0.0771       2.58 2.01e- 2
## 3 <split [20/8]> Bootstrap0001 <lm>  x2      -0.489   0.00943    -51.8  2.98e-19
## 4 <split [20/8]> Bootstrap0001 <lm>  x3Yes    1.14    0.487        2.35 3.20e- 2
## 5 <split [20/6]> Bootstrap0002 <lm>  (Int…    3.05    0.780        3.92 1.23e- 3
## 6 <split [20/6]> Bootstrap0002 <lm>  x1       0.234   0.102        2.29 3.60e- 2

We can then calculate the bootstrap intervals by obtaining the \(2.5^{th}\) and \(97.5^{th}\) percentiles - similar to a 95% confidence interval as we are finding the values that contain the middle 95% of the bootstrap values. Note that we provide the level of significance (1 - confidence level): \(\alpha = 0.05 = 1 - 0.95\).

# Calculate the 95% bootstrap confidence intervals
boot_int <- int_pctl(boot_models, statistics = coef_info, alpha = 0.05)

# Display the bootstrap confidence intervals
boot_int

## # A tibble: 4 × 6
##   term        .lower .estimate .upper .alpha .method   
##   <chr>        <dbl>     <dbl>  <dbl>  <dbl> <chr>     
## 1 (Intercept)  0.176     1.93   3.60    0.05 percentile
## 2 x1           0.190     0.340  0.506   0.05 percentile
## 3 x2          -0.516    -0.495 -0.470   0.05 percentile
## 4 x3Yes        0.892     1.83   2.83    0.05 percentile

I like to also visualize this information to get a sense of the variability of my estimates.

# Visualization of bootstrap estimates
ggplot(boots_coefs, aes(x = estimate)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  facet_wrap(~ term, scales = "free") +
  geom_vline(data = boot_int, aes(xintercept = .lower), col = "blue", linetype = "dashed") +
  geom_vline(data = boot_int, aes(xintercept = .upper), col = "blue", linetype = "dashed") +
  labs(title = "Distribution of Bootstrap Estimates with 95% CI",
       x = "Estimate",
       y = "Frequency") +
  theme_minimal()

This code creates a histogram for each term’s estimate from the bootstrap samples and overlays the 95% confidence interval lines. The facet_wrap function is used to create separate plots for each term, and the geom_vline function adds vertical lines representing the lower and upper bounds of the bootstrap confidence intervals.

Let’s answer the following question:

Looking at our population-level model from (2), how accurate are our results?

Comparison and Evaluation

Intercept:
Population: 2
Bootstrap CI: [0.18, 3.60]
Evaluation: The population value (2) falls within the bootstrap confidence interval, suggesting that the bootstrap interval provides a reasonable estimate for the intercept.

x1 Coefficient:
Population: 0.25
Bootstrap CI: [0.19, 0.51]
Evaluation: The population value (0.25) falls within the bootstrap confidence interval. The estimate from the bootstrap is close to the population value, indicating good accuracy.

x2 Coefficient:
Population: -0.5
Bootstrap CI: [-0.52, -0.47]
Evaluation: The population value (-0.5) falls within the bootstrap confidence interval, and the estimate from the bootstrap is very close to the population value, indicating high accuracy.

x3Yes Coefficient:
Population: 1
Bootstrap CI: [0.89, 2.83]
Evaluation: The population value (1) falls within the bootstrap confidence interval, but the interval is quite wide. The point estimate is higher than the population value, suggesting that there might be some variability, but it still captures the true effect.

Conclusion:
The bootstrap intervals show that the estimated coefficients for x1 and x2 are highly accurate, as the true population values are very close to the bootstrap point estimates and lie within the confidence intervals. The intercept and x3 coefficient also capture the true population values within their intervals, but with more variability, especially for x3. Overall, the bootstrap method provides reasonable and accurate estimates of the population-level coefficients, aligning well with the true model.

Challenge

Add the following to the previous code chunk:

A pair of vertical orange lines that correspond to the traditional method’s 95% confidence intervals
A single vertical blue line that corresponds to the population slope value (there should be a unique blue line in each plot)

# Define traditional 95% confidence intervals
traditional_ci <- tibble(
  term = c("(Intercept)", "x1", "x2", "x3Yes"),
  lower = c(0.3391323, 0.1458750, -0.5183913, 0.7347769),
  upper = c(3.5452396, 0.5237101, -0.4724033, 2.8936610)
)

# Define population slope values
population_values <- tibble(
  term = c("(Intercept)", "x1", "x2", "x3Yes"),
  value = c(2, 0.25, -0.5, 1)
)

# Plotting the bootstrap coefficients with additional elements
ggplot(boots_coefs, aes(x = estimate)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  facet_wrap(~ term, scales = "free") +
  geom_vline(data = boot_int, aes(xintercept = .lower), col = "blue", linetype = "dashed") +
  geom_vline(data = boot_int, aes(xintercept = .upper), col = "blue", linetype = "dashed") +
  geom_vline(data = traditional_ci, aes(xintercept = lower), col = "orange", linetype = "solid") +
  geom_vline(data = traditional_ci, aes(xintercept = upper), col = "orange", linetype = "solid") +
  geom_vline(data = population_values, aes(xintercept = value), col = "red", linetype = "dotted", size = 1.2) +
  labs(title = "Bootstrap Estimates with Traditional and Population Values",
       x = "Estimate",
       y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        strip.text = element_text(size = 14),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.