Discussion 1: Bootstrap intervals

The power of the bootstrap is to provide good estimates of the standard errors for statistics that are not as easy to work with. In turn, they can be used to provide confidence intervals (CIs) where no theoretical formula exists otherwise.

From the videos, Dr. Turner provided a general framework for providing CI’s using the bootstrap procedure. For example to provide a bootstrap interval for the median, there are two main steps. The first is to write a function, such as the following, that will produce the statistic of interest of any given bootstrap sample:

med.fun<-function(x,i){median(x[i])}

ANSWER: a. med.boot <- boot(data, med.fun, R = 2000) # Example for median

ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = data, statistic = med.fun, R = 2000) Bootstrap Statistics : original bias std. error t1* 8.435969 0.05062841 0.5347024

This creates a sampling distribution of the statistic, mimicking repeated sampling from the population.

  1. boot.ci(med.boot, conf = 0.95, type = c(“perc”, “basic”, “bca”)) Intervals : Level Basic Percentile BCa
    95% ( 7.474, 9.325 ) ( 7.547, 9.398 ) ( 7.489, 9.379 )
    Calculations and Intervals on Original Scale

This provides different types of intervals (percentile, basic, BCa) with varying properties and assumptions.

A. Compute the following lines of code using the generic data set provided. Briefly explain what is the role of \(i\) in the previous function is.

generic.data<-c(18,10,16,14)
median(generic.data)
index<-c(4,3,1,3)
generic.data[index]
median(generic.data[index])
med.fun(generic.data,index)

ANSWER:

Breakdown of the code and the role of i: Code Output:

median(generic.data): 15 generic.data[index]: 14 16 18 16 median(generic.data[index]): 16 med.fun(generic.data, index): 16

Role of i: Resampling Index: i acts as an index vector, determining which observations from the original dataset are included in a resampled bootstrap sample. Function Argument: It’s passed as an argument to the med.fun function, enabling the function to calculate the median of a specific bootstrap sample created using the indices in i. Bootstrapping Mechanism: The boot function internally generates these index vectors to create multiple bootstrap samples, each with a potentially different arrangement of observations.

Example: index = c(4, 3, 1, 3) specifies a bootstrap sample containing the 4th, 3rd, 1st, and 3rd (repeated) observations from generic.data. med.fun(generic.data, index) calculates the median of this specific sample, resulting in 16.

Key Point:i facilitates the resampling process, which is a core concept in bootstrapping to estimate variability and create confidence intervals.

The second step is to compute the bootstrap sampling distribution using your statistic and compute the interval. The following example applies the bootstrap to a sample that is skewed right:

library(boot)
data<-rgamma(100,2,scale=5)
hist(data)

#mean(data)
#median(data)

#Performing Bootstap
med.boot<-boot(data,med.fun,R=2000)
#sd(med.boot$t) #SE from bootstrap

#Obtain CI's : Basic is "Empirical"
boot.ci(med.boot,conf=0.95,type=c("perc","basic","bca"))
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = med.boot, conf = 0.95, type = c("perc", "basic", 
##     "bca"))
## 
## Intervals : 
## Level      Basic              Percentile            BCa          
## 95%   ( 6.213, 10.971 )   ( 6.238, 10.997 )   ( 6.202, 10.933 )  
## Calculations and Intervals on Original Scale

B. Suppose our main interest was the variability of the data set. Modify the previous code to make a confidence interval for the standard deviation and the interquartile range (IQR) of the data set. The functions you can use are sd and IQR

ANSWER: # Confidence interval for standard deviation sd.boot <- boot(data, function(x, i) sd(x[i]), R = 2000) boot.ci(sd.boot, conf = 0.95, type = c(“perc”, “basic”, “bca”))
# Generate the gamma-distributed data data <- rgamma(100, 2, scale = 5) # Confidence interval for IQR iqr.boot <- boot(data, function(x, i) IQR(x[i]), R = 2000) boot.ci(iqr.boot, conf = 0.95, type = c(“perc”, “basic”, “bca”))

Custom function to calculate standard deviation

sd_fun <- function(data, indices) { sample_data <- data[indices] sd(sample_data) }

Calculate the standard deviation using bootstrap

sd.boot <- boot(data, sd_fun, R = 2000)

Obtain CI’s for standard deviation

sd.ci <- boot.ci(sd.boot, conf = 0.95, type = c(“perc”, “basic”, “bca”))

Plot the log-transformed data

hist(log_y, main=“Histogram of Log-Transformed Response Variable”, xlab=“Log(Y)”)

The histogram of the log-transformed response variable indicates that the transformation did not correct the issue of non-constant variance. The data still shows skewness, which suggests that other methods may be needed to stabilize the variance.

  1. The following code chunk provides three CIs for the slope and intercept. Compare the intervals to each other, and provide an explanation of which one you would prefer.
library(lmboot)
## Warning: package 'lmboot' was built under R version 4.3.2
#PAIRED BOOTSTRAP
boot.p<-paired.boot(y~x,B=1000,seed=1234)
t(apply(boot.p$bootEstParam,2,quantile,probs=c(.025,.975)))
##                 2.5%    97.5%
## (Intercept) 4.263864 26.64593
## x           3.883905  5.38886
#RESITUAL BOOTSTRAP
boot.res<-residual.boot(y~x,B=1000,seed=1234)
t(apply(boot.res$bootEstParam,2,quantile,probs=c(.025,.975)))
##                 2.5%     97.5%
## (Intercept) 6.910136 24.648401
## x           3.897229  5.375196
#Traditional MLR
confint(lm(y~x))
##                2.5 %    97.5 %
## (Intercept) 6.605142 24.661904
## x           3.886191  5.355665

ANSWER:

The confidence intervals for the intercept and slope from the paired bootstrap, residual bootstrap, and traditional multiple linear regression (MLR) are as follows:

Paired Bootstrap Confidence Intervals: Intercept: 4.26 to 26.65 Slope: 3.88 to 5.39

Residual Bootstrap Confidence Intervals: Intercept: 6.91 to 24.65 Slope: 3.90 to 5.38

Traditional MLR Confidence Intervals: Intercept: 6.61 to 24.66 Slope: 3.89 to 5.36

The intervals for the slope are quite similar across all three methods, which suggests that regardless of the method, our estimates for the slope are consistent. The intercept, however, shows a wider range in the paired bootstrap method. This might be due to the paired bootstrap method taking into account the variability in both x and y through resampling pairs, which could reflect a greater uncertainty when the variance is not constant.

If I were to choose which confidence interval method to prefer, it would depend on the specifics of my data and the assumptions I’m willing to make. Given the similarity in slope estimates, I would prefer the method that aligns best with the underlying distribution of my data. If the assumptions of the traditional MLR are met, I might lean towards it for its simplicity and the interpretability of its results. However, if I had concerns about the assumptions, particularly about the normality and independence of errors, I might prefer the bootstrap methods, which are more robust to such violations.

Discussion 3: Boostrap Regression When Incorrectly Modeling the Trend

It is important to note that confidence intervals for regression coefficients are valid only when the assumptions for the errors are met. However, another important key to confidence intervals validity is that the trend being modeled is correctly specified.

The following code produces a data set that follows a quadratic trend with constant variance.

set.seed(1234)
x<-runif(50,5,20)
y<-c()
for (i in 1:length(x)){
y[i]<-810-20*x[i]+4*(x[i]-8)^2+rnorm(1,mean=0,sd=40)
}

plot(x,y,main="Original Sample")

Like the previous exercise, we can compare the paired bootstrap interval to the traditional intervals when we correctly model the quadratic trend. Note that the intervals for all coefficients are pretty much in alignment.

I want you to investigate what happens if you incorrectly specify the model as only linear versus quadratic. Make note of any difference in the intervals between the bootstrap and classic intervals for intercept and slope. Which CI method would you prefer in this case if your goal was to provide an estimate of the linear slope?

library(lmboot)

#PAIRED BOOTSTRAP
boot.p<-paired.boot(y~poly(x,2),B=1000,seed=1234)
t(apply(boot.p$bootEstParam,2,quantile,probs=c(.025,.975)))
##                 2.5%    97.5%
## (Intercept) 675.6235 698.6993
## poly(x, 2)1 349.9385 516.0560
## poly(x, 2)2 289.2160 435.2738
#Traditional MLR
confint(lm(y~poly(x,2)))
##                2.5 %   97.5 %
## (Intercept) 674.3551 699.3721
## poly(x, 2)1 342.9952 519.8923
## poly(x, 2)2 274.9657 451.8629

ANSWER: In the R output, the confidence intervals for the intercept and the quadratic terms are given from both paired bootstrap and traditional multiple linear regression (MLR) methods. Here’s a brief comparison:

Paired Bootstrap Confidence Intervals: Intercept: 675.62 to 698.70 poly(x, 2)1 (Linear Term): 349.94 to 516.06 poly(x, 2)2 (Quadratic Term): 289.22 to 435.27

Traditional MLR Confidence Intervals: Intercept: 674.36 to 699.37 poly(x, 2)1 (Linear Term): 342.99 to 519.89 poly(x, 2)2 (Quadratic Term): 274.97 to 451.86

The intervals are quite close, indicating that both methods give similar estimates for the coefficients of a quadratic model.

The plot confirms a quadratic relationship between x and y. The paired bootstrap and traditional MLR show similar confidence intervals for the coefficients, which means both methods work well for the correctly specified quadratic model. If I used a linear model for this quadratic trend, the confidence intervals for the slope would be incorrect, as they wouldn’t account for the curvature in the data. For estimating a linear slope in the presence of a quadratic relationship, it’s best to communicate that the linear estimate is conditional on the quadratic term. In this scenario, even though both methods provide similar intervals, I’d communicate the limitations of using a linear model for a quadratic trend.