Erick Anangwe 07-13-2024
# Set a random seed value so we can obtain the same "random" results
set.seed(2023)
# Create a data frame/tibble named sim_dat
sim_dat <- tibble(
# Explain what next line is doing
x1 = runif(20, -5, 5),
# Explain what next line is doing
x2 = runif(20, 0, 100),
# Explain what next line is doing
x3 = rbinom(20, 1, 0.5)
)
b0 <- 2
b1 <- 0.25
b2 <- -0.5
b3 <- 1
sigma <- 1.5
errors <- rnorm(20, 0, sigma)
sim_dat <- sim_dat %>%
mutate(
y = b0 + b1*x1 + b2*x2 + b3*x3 + errors,
x3 = case_when(
x3 == 0 ~ "No",
TRUE ~ "Yes"
)
)
Let’s complete the following tasks:
# Set a random seed value so we can obtain the same "random" results
set.seed(2023)
# Create a data frame/tibble named sim_dat
sim_dat <- tibble(
# Generate 20 random numbers between -5 and 5 for variable x1
x1 = runif(20, -5, 5),
# Generate 20 random numbers between 0 and 100 for variable x2
x2 = runif(20, 0, 100),
# Generate 20 random binary values (0 or 1) with a probability of 0.5 for variable x3
x3 = rbinom(20, 1, 0.5)
)
# Define the true coefficients for the linear model
b0 <- 2 # Intercept
b1 <- 0.25 # Coefficient for x1
b2 <- -0.5 # Coefficient for x2
b3 <- 1 # Coefficient for x3
sigma <- 1.5 # Standard deviation of the errors
# Generate random errors with mean 0 and standard deviation 1.5
errors <- rnorm(20, 0, sigma)
# Compute the response variable y using the true model and add it to the dataset
sim_dat <- sim_dat %>%
mutate(
y = b0 + b1*x1 + b2*x2 + b3*x3 + errors, # Calculate y based on the linear model and errors
x3 = case_when(
x3 == 0 ~ "No", # Convert 0 values of x3 to "No"
TRUE ~ "Yes" # Convert 1 values of x3 to "Yes"
)
)
# Display the first few rows of the dataset
head(sim_dat)
## # A tibble: 6 × 4
## x1 x2 x3 y
## <dbl> <dbl> <chr> <dbl>
## 1 -0.334 62.8 No -29.9
## 2 -1.65 43.8 Yes -17.4
## 3 -3.37 34.0 No -17.1
## 4 -1.04 93.3 No -45.0
## 5 -4.70 71.3 Yes -34.8
## 6 -3.79 63.1 Yes -30.1
The true (population-level) model is the baseline model without the added noise (errors). Based on the given code, the true model is a linear combination of the variables x1, x2, and x3 with their respective coefficients, plus an intercept term. The true model is as follows:
y = b0 + b1.x1 + b2.x2 + b3.x3
Substituting the coefficients from the code:
y = 2 + 0.25.x1 - 0.5.x2 + 1.x3
Where:
b0 = 2 (intercept)
b1 = 0.25 (coefficient for x1)
b2 = −0.5 (coefficient for x2)
b3 = 1 (coefficient for x3)
The variable x3 is binary and represented as 0 (No) or 1 (Yes). The noise (errors) added in the actual model simulations represents random variability around this true model.
y
and each
x
and also each x
pair) and provide a brief
summary of what we see/notice.To create graphical visualizations for the relationships between all
variable pairs, we can use the GGally
package, which
extends ggplot2
to create pair plots easily.
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Create a pair plot to visualize relationships between all variables
pair_plot <- ggpairs(sim_dat, columns = 1:4, aes(color = x3))
# Display the pair plot
pair_plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There appears to be a positive linear relationship between y and x1. This aligns with the true model where x1 has a positive coefficient of 0.25.
There is a negative linear relationship between y and x2. This is consistent with the true model where x2 has a negative coefficient of -0.5.
First we will fit an estimated model to our simulated data.
# Fit the multiple linear regression model
mlr_fit <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm") %>%
fit(y ~ x1 + x2 + x3, data = sim_dat)
# Include the confidence intervals for the estimated slope parameters
mlr_results <- tidy(mlr_fit, conf.int = TRUE)
# Display the results
mlr_results
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.94 0.756 2.57 2.06e- 2 0.339 3.55
## 2 x1 0.335 0.0891 3.76 1.72e- 3 0.146 0.524
## 3 x2 -0.495 0.0108 -45.7 2.22e-18 -0.518 -0.472
## 4 x3Yes 1.81 0.509 3.56 2.59e- 3 0.735 2.89
Let’s answer the following question:
Detailed Evaluation:
Intercept (b0):
True value: 2
Estimated value: 1.942
Confidence Interval: [0.339, 3.545]
Assessment: The estimate (1.942) is very close to the true value (2), and the true value lies within the confidence interval. This indicates that the model is accurately estimating the intercept.
x1 (b1):
True value: 0.25
Estimated value: 0.335
Confidence Interval: [0.146, 0.524]
Assessment: The estimate (0.335) is slightly higher than the true value (0.25), but the true value is within the confidence interval. This suggests that the model is fairly accurate for x1.
x2 (b2):
True value: -0.5
Estimated value: -0.495
Confidence Interval: [-0.518, -0.472]
Assessment: The estimate (-0.495) is very close to the true value (-0.5), and the true value falls within the confidence interval. This shows high accuracy for x2.
x3 (b3):
True value: 1
Estimated value: 1.814
Confidence Interval: [0.735, 2.894]
Assessment: The estimate (1.814) is higher than the true value (1), but the true value is within the confidence interval. This indicates some variability but still reasonable accuracy for x3.
Conclusion:
Overall, the MLR model provides estimates that are quite close to the true population-level coefficients. Each true coefficient lies within the respective confidence intervals of the estimates, indicating that the model is accurately capturing the underlying relationships despite the added noise. The standard errors and p-values also support the statistical significance of the estimates. This evaluation confirms that the model is reliable and accurate in estimating the true coefficients.
Now, bootstrapping treats our sample of data as a psuedo-population that we will create multiple new samples from. Each sample will be of the same size and same number of variables as the original (\(n = 20; p = 3\)). However, once a row has been sampled, it will be put back into the “population” (i.e., sampling with replacement). With computing power being relatively cheap now, we can do this for a large number of times to build up \(B\) bootstrap samples. For example, we might do \(B = 2,000\) bootstrap samples each of \(n = 20\).
# Set a random seed value so we can obtain the same "random" results
set.seed(631)
# Generate the 2000 bootstrap samples
boot_samps <- sim_dat %>%
bootstraps(times = 2000)
# View the structure of the bootstrap samples
boot_samps
## # Bootstrap sampling
## # A tibble: 2,000 × 2
## splits id
## <list> <chr>
## 1 <split [20/8]> Bootstrap0001
## 2 <split [20/6]> Bootstrap0002
## 3 <split [20/6]> Bootstrap0003
## 4 <split [20/6]> Bootstrap0004
## 5 <split [20/10]> Bootstrap0005
## 6 <split [20/10]> Bootstrap0006
## 7 <split [20/7]> Bootstrap0007
## 8 <split [20/6]> Bootstrap0008
## 9 <split [20/8]> Bootstrap0009
## 10 <split [20/6]> Bootstrap0010
## # ℹ 1,990 more rows
Now we need to fit a linear model to each bootstrap sample.
We will use some of the handy iteration functions from
{purrr}
.
# Create a function that fits a fixed MLR model to one split dataset
fit_mlr_boots <- function(split) {
lm(y ~ x1 + x2 + x3, data = analysis(split))
}
# Fit the model to each split and store the information
# Also, obtain the tidy model information
boot_models <- boot_samps %>%
mutate(
model = map(splits, fit_mlr_boots),
coef_info = map(model, tidy)
)
# Unnest the coefficient information from the models
boots_coefs <- boot_models %>%
unnest(coef_info)
# Display the first few rows of the bootstrap coefficients
head(boots_coefs)
## # A tibble: 6 × 8
## splits id model term estimate std.error statistic p.value
## <list> <chr> <lis> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 <split [20/8]> Bootstrap0001 <lm> (Int… 1.78 0.615 2.90 1.05e- 2
## 2 <split [20/8]> Bootstrap0001 <lm> x1 0.199 0.0771 2.58 2.01e- 2
## 3 <split [20/8]> Bootstrap0001 <lm> x2 -0.489 0.00943 -51.8 2.98e-19
## 4 <split [20/8]> Bootstrap0001 <lm> x3Yes 1.14 0.487 2.35 3.20e- 2
## 5 <split [20/6]> Bootstrap0002 <lm> (Int… 3.05 0.780 3.92 1.23e- 3
## 6 <split [20/6]> Bootstrap0002 <lm> x1 0.234 0.102 2.29 3.60e- 2
We can then calculate the bootstrap intervals by obtaining the \(2.5^{th}\) and \(97.5^{th}\) percentiles - similar to a 95% confidence interval as we are finding the values that contain the middle 95% of the bootstrap values. Note that we provide the level of significance (1 - confidence level): \(\alpha = 0.05 = 1 - 0.95\).
# Calculate the 95% bootstrap confidence intervals
boot_int <- int_pctl(boot_models, statistics = coef_info, alpha = 0.05)
# Display the bootstrap confidence intervals
boot_int
## # A tibble: 4 × 6
## term .lower .estimate .upper .alpha .method
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 0.176 1.93 3.60 0.05 percentile
## 2 x1 0.190 0.340 0.506 0.05 percentile
## 3 x2 -0.516 -0.495 -0.470 0.05 percentile
## 4 x3Yes 0.892 1.83 2.83 0.05 percentile
I like to also visualize this information to get a sense of the variability of my estimates.
# Visualization of bootstrap estimates
ggplot(boots_coefs, aes(x = estimate)) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
facet_wrap(~ term, scales = "free") +
geom_vline(data = boot_int, aes(xintercept = .lower), col = "blue", linetype = "dashed") +
geom_vline(data = boot_int, aes(xintercept = .upper), col = "blue", linetype = "dashed") +
labs(title = "Distribution of Bootstrap Estimates with 95% CI",
x = "Estimate",
y = "Frequency") +
theme_minimal()
This code creates a histogram for each term’s estimate from the
bootstrap samples and overlays the 95% confidence interval lines. The
facet_wrap
function is used to create separate plots for
each term, and the geom_vline
function adds vertical lines
representing the lower and upper bounds of the bootstrap confidence
intervals.
Let’s answer the following question:
Comparison and Evaluation
Intercept:
Population: 2
Bootstrap CI: [0.18, 3.60]
Evaluation: The population value (2) falls within the bootstrap confidence interval, suggesting that the bootstrap interval provides a reasonable estimate for the intercept.
x1 Coefficient:
Population: 0.25
Bootstrap CI: [0.19, 0.51]
Evaluation: The population value (0.25) falls within the bootstrap confidence interval. The estimate from the bootstrap is close to the population value, indicating good accuracy.
x2 Coefficient:
Population: -0.5
Bootstrap CI: [-0.52, -0.47]
Evaluation: The population value (-0.5) falls within the bootstrap confidence interval, and the estimate from the bootstrap is very close to the population value, indicating high accuracy.
x3Yes Coefficient:
Population: 1
Bootstrap CI: [0.89, 2.83]
Evaluation: The population value (1) falls within the bootstrap confidence interval, but the interval is quite wide. The point estimate is higher than the population value, suggesting that there might be some variability, but it still captures the true effect.
Conclusion:
The bootstrap intervals show that the estimated coefficients for x1 and x2 are highly accurate, as the true population values are very close to the bootstrap point estimates and lie within the confidence intervals. The intercept and x3 coefficient also capture the true population values within their intervals, but with more variability, especially for x3. Overall, the bootstrap method provides reasonable and accurate estimates of the population-level coefficients, aligning well with the true model.
Add the following to the previous code chunk:
# Define traditional 95% confidence intervals
traditional_ci <- tibble(
term = c("(Intercept)", "x1", "x2", "x3Yes"),
lower = c(0.3391323, 0.1458750, -0.5183913, 0.7347769),
upper = c(3.5452396, 0.5237101, -0.4724033, 2.8936610)
)
# Define population slope values
population_values <- tibble(
term = c("(Intercept)", "x1", "x2", "x3Yes"),
value = c(2, 0.25, -0.5, 1)
)
# Plotting the bootstrap coefficients with additional elements
ggplot(boots_coefs, aes(x = estimate)) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
facet_wrap(~ term, scales = "free") +
geom_vline(data = boot_int, aes(xintercept = .lower), col = "blue", linetype = "dashed") +
geom_vline(data = boot_int, aes(xintercept = .upper), col = "blue", linetype = "dashed") +
geom_vline(data = traditional_ci, aes(xintercept = lower), col = "orange", linetype = "solid") +
geom_vline(data = traditional_ci, aes(xintercept = upper), col = "orange", linetype = "solid") +
geom_vline(data = population_values, aes(xintercept = value), col = "red", linetype = "dotted", size = 1.2) +
labs(title = "Bootstrap Estimates with Traditional and Population Values",
x = "Estimate",
y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
strip.text = element_text(size = 14),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.