The power of the bootstrap is to provide good estimates of the standard errors for statistics that are not as easy to work with. In turn, they can be used to provide confidence intervals (CIs) where no theoretical formula exists otherwise.
From the videos, Dr. Turner provided a general framework for providing CI’s using the bootstrap procedure. For example to provide a bootstrap interval for the median, there are two main steps. The first is to write a function, such as the following, that will produce the statistic of interest of any given bootstrap sample:
med.fun<-function(x,i){median(x[i])}
ANSWER: a. med.boot <- boot(data, med.fun, R = 2000) # Example for median
ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = data, statistic = med.fun, R = 2000) Bootstrap Statistics : original bias std. error t1* 8.435969 0.05062841 0.5347024
This creates a sampling distribution of the statistic, mimicking repeated sampling from the population.
This provides different types of intervals (percentile, basic, BCa) with varying properties and assumptions.
A. Compute the following lines of code using the generic data set provided. Briefly explain what is the role of \(i\) in the previous function is.
generic.data<-c(18,10,16,14)
median(generic.data)
index<-c(4,3,1,3)
generic.data[index]
median(generic.data[index])
med.fun(generic.data,index)
ANSWER:
Breakdown of the code and the role of i: Code Output:
median(generic.data): 15 generic.data[index]: 14 16 18 16 median(generic.data[index]): 16 med.fun(generic.data, index): 16
Role of i: Resampling Index: i acts as an index vector, determining which observations from the original dataset are included in a resampled bootstrap sample. Function Argument: It’s passed as an argument to the med.fun function, enabling the function to calculate the median of a specific bootstrap sample created using the indices in i. Bootstrapping Mechanism: The boot function internally generates these index vectors to create multiple bootstrap samples, each with a potentially different arrangement of observations.
Example: index = c(4, 3, 1, 3) specifies a bootstrap sample containing the 4th, 3rd, 1st, and 3rd (repeated) observations from generic.data. med.fun(generic.data, index) calculates the median of this specific sample, resulting in 16.
The second step is to compute the bootstrap sampling distribution using your statistic and compute the interval. The following example applies the bootstrap to a sample that is skewed right:
library(boot)
data<-rgamma(100,2,scale=5)
hist(data)
#mean(data)
#median(data)
#Performing Bootstap
med.boot<-boot(data,med.fun,R=2000)
#sd(med.boot$t) #SE from bootstrap
#Obtain CI's : Basic is "Empirical"
boot.ci(med.boot,conf=0.95,type=c("perc","basic","bca"))
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = med.boot, conf = 0.95, type = c("perc", "basic",
## "bca"))
##
## Intervals :
## Level Basic Percentile BCa
## 95% ( 6.213, 10.971 ) ( 6.238, 10.997 ) ( 6.202, 10.933 )
## Calculations and Intervals on Original Scale
B. Suppose our main interest was the variability of the data set.
Modify the previous code to make a confidence interval for the standard
deviation and the interquartile range (IQR) of the data set. The
functions you can use are sd
and IQR
ANSWER: # Confidence interval for standard deviation sd.boot <-
boot(data, function(x, i) sd(x[i]), R = 2000) boot.ci(sd.boot, conf =
0.95, type = c(“perc”, “basic”, “bca”))
# Generate the gamma-distributed data data <- rgamma(100, 2, scale =
5) # Confidence interval for IQR iqr.boot <- boot(data, function(x,
i) IQR(x[i]), R = 2000) boot.ci(iqr.boot, conf = 0.95, type = c(“perc”,
“basic”, “bca”))
sd_fun <- function(data, indices) { sample_data <- data[indices] sd(sample_data) }
sd.boot <- boot(data, sd_fun, R = 2000)
sd.ci <- boot.ci(sd.boot, conf = 0.95, type = c(“perc”, “basic”, “bca”))
print(“Confidence Interval for Standard Deviation:”) print(sd.ci)
C. The next code provides the classic confidence interval for a population standard deviation assuming the data is normally distributed. Run the code, and compare the interval to your bootstrap interval. Which one is wider?
n<-length(data)
Lower<-sqrt((n-1)*var(data)/qchisq(.05/2,n-1,lower.tail=F))
Upper<-sqrt((n-1)*var(data)/qchisq(.05/2,n-1,lower.tail=T))
c(Lower,Upper)
## [1] 6.304388 8.341224
ANSWER: The classic confidence interval for the population standard deviation assuming normal distribution is approximately (6.315, 8.355).
Comparing this interval with the bootstrap interval obtained earlier for the standard deviation:
Bootstrap Interval: Approximately (5.917, 8.682) (using “basic” type) Classic Confidence Interval: Approximately (6.315, 8.355)
The classic confidence interval is slightly wider than the bootstrap interval, indicating that it provides a slightly larger range for the standard deviation. This is expected because the classic confidence interval is based on theoretical assumptions of normality, while the bootstrap interval is based on resampling from the data and does not rely on distributional assumptions. In practice, the bootstrap interval can be more robust for non-normally distributed data.
The videos discussed two approaches for the bootstrap in a linear regression setting. The first approach is to resample pairs, and the second approach is to resample residuals.
The following data set exhibits a situation where constant variance is not met.
set.seed(1234)
x<-runif(100,5,20)
sds<- (x-25)^2/15
y<-c()
for (i in 1:length(x)){
y[i]<-10+5*x[i]+rnorm(1,mean=0,sd=sds[i])
}
plot(x,y,main="Original Sample")
hist(log_y, main=“Histogram of Log-Transformed Response Variable”, xlab=“Log(Y)”)
The histogram of the log-transformed response variable indicates that the transformation did not correct the issue of non-constant variance. The data still shows skewness, which suggests that other methods may be needed to stabilize the variance.
library(lmboot)
## Warning: package 'lmboot' was built under R version 4.3.2
#PAIRED BOOTSTRAP
boot.p<-paired.boot(y~x,B=1000,seed=1234)
t(apply(boot.p$bootEstParam,2,quantile,probs=c(.025,.975)))
## 2.5% 97.5%
## (Intercept) 4.263864 26.64593
## x 3.883905 5.38886
#RESITUAL BOOTSTRAP
boot.res<-residual.boot(y~x,B=1000,seed=1234)
t(apply(boot.res$bootEstParam,2,quantile,probs=c(.025,.975)))
## 2.5% 97.5%
## (Intercept) 6.910136 24.648401
## x 3.897229 5.375196
#Traditional MLR
confint(lm(y~x))
## 2.5 % 97.5 %
## (Intercept) 6.605142 24.661904
## x 3.886191 5.355665
ANSWER:
The confidence intervals for the intercept and slope from the paired bootstrap, residual bootstrap, and traditional multiple linear regression (MLR) are as follows:
Paired Bootstrap Confidence Intervals: Intercept: 4.26 to 26.65 Slope: 3.88 to 5.39
Residual Bootstrap Confidence Intervals: Intercept: 6.91 to 24.65 Slope: 3.90 to 5.38
Traditional MLR Confidence Intervals: Intercept: 6.61 to 24.66 Slope: 3.89 to 5.36
The intervals for the slope are quite similar across all three methods, which suggests that regardless of the method, our estimates for the slope are consistent. The intercept, however, shows a wider range in the paired bootstrap method. This might be due to the paired bootstrap method taking into account the variability in both x and y through resampling pairs, which could reflect a greater uncertainty when the variance is not constant.
If I were to choose which confidence interval method to prefer, it would depend on the specifics of my data and the assumptions I’m willing to make. Given the similarity in slope estimates, I would prefer the method that aligns best with the underlying distribution of my data. If the assumptions of the traditional MLR are met, I might lean towards it for its simplicity and the interpretability of its results. However, if I had concerns about the assumptions, particularly about the normality and independence of errors, I might prefer the bootstrap methods, which are more robust to such violations.
It is important to note that confidence intervals for regression coefficients are valid only when the assumptions for the errors are met. However, another important key to confidence intervals validity is that the trend being modeled is correctly specified.
The following code produces a data set that follows a quadratic trend with constant variance.
set.seed(1234)
x<-runif(50,5,20)
y<-c()
for (i in 1:length(x)){
y[i]<-810-20*x[i]+4*(x[i]-8)^2+rnorm(1,mean=0,sd=40)
}
plot(x,y,main="Original Sample")
Like the previous exercise, we can compare the paired bootstrap interval to the traditional intervals when we correctly model the quadratic trend. Note that the intervals for all coefficients are pretty much in alignment.
I want you to investigate what happens if you incorrectly specify the model as only linear versus quadratic. Make note of any difference in the intervals between the bootstrap and classic intervals for intercept and slope. Which CI method would you prefer in this case if your goal was to provide an estimate of the linear slope?
library(lmboot)
#PAIRED BOOTSTRAP
boot.p<-paired.boot(y~poly(x,2),B=1000,seed=1234)
t(apply(boot.p$bootEstParam,2,quantile,probs=c(.025,.975)))
## 2.5% 97.5%
## (Intercept) 675.6235 698.6993
## poly(x, 2)1 349.9385 516.0560
## poly(x, 2)2 289.2160 435.2738
#Traditional MLR
confint(lm(y~poly(x,2)))
## 2.5 % 97.5 %
## (Intercept) 674.3551 699.3721
## poly(x, 2)1 342.9952 519.8923
## poly(x, 2)2 274.9657 451.8629
ANSWER: In the R output, the confidence intervals for the intercept and the quadratic terms are given from both paired bootstrap and traditional multiple linear regression (MLR) methods. Here’s a brief comparison:
Paired Bootstrap Confidence Intervals: Intercept: 675.62 to 698.70 poly(x, 2)1 (Linear Term): 349.94 to 516.06 poly(x, 2)2 (Quadratic Term): 289.22 to 435.27
Traditional MLR Confidence Intervals: Intercept: 674.36 to 699.37 poly(x, 2)1 (Linear Term): 342.99 to 519.89 poly(x, 2)2 (Quadratic Term): 274.97 to 451.86
The intervals are quite close, indicating that both methods give similar estimates for the coefficients of a quadratic model.
The plot confirms a quadratic relationship between x and y. The paired bootstrap and traditional MLR show similar confidence intervals for the coefficients, which means both methods work well for the correctly specified quadratic model. If I used a linear model for this quadratic trend, the confidence intervals for the slope would be incorrect, as they wouldn’t account for the curvature in the data. For estimating a linear slope in the presence of a quadratic relationship, it’s best to communicate that the linear estimate is conditional on the quadratic term. In this scenario, even though both methods provide similar intervals, I’d communicate the limitations of using a linear model for a quadratic trend.