Bootstrap t

Z confidence intervals

if underlying population is normal
and we know \( \sigma \)

t confidence intervals

if the underlying population is normal,
or at least symmteric and we have a large sample size
we use the sample standard deviation (aka standard error) instead of \( \sigma \)

not normal? not symmetric?

we need a sampling distribution to use instead of the t-distribution
bootstrap
previously we saw a percentage bootstrap confidence interval
now we'll study a version called the bootstrap t

Data example

Load the Bangladesh data set
make a vector of the chlorine levels
make sure to omit the NA values
Look at and comment on the sample data

library(resampledata)
chlor <- na.omit(Bangladesh$Chlorine)
hist(chlor)

plot of chunk unnamed-chunk-1

The distribution is skewed strongly to the right.

Because the data is skewed a \( t \) confidence interval is not appropriate.
Nonetheless, for comparison purposes, find a 95% \( t \) confidence interval for the mean cholorine level

Let's do a new comparison as we go.
Go back and time the t-test by adding the time commands:

start_time <- Sys.time()
t.test(chlor)


    One Sample t-test

data:  chlor
t = 6.0979, df = 268, p-value = 3.736e-09
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  52.87263 103.29539
sample estimates:
mean of x 
 78.08401

end_time <- Sys.time()
end_time - start_time

Time difference of 0.007549047 secs

bootstrap percentile confidence interval

Find a 95% bootstrap confidence interval
This is what we did in Chapter 5.
Remember the idea:
- do a bootstrap resample
- compute its mean
- store that mean in a vector
- repeat 10⁵ times
- find the middle 95% of the bootstrap distribution
time the bootstrap loop

Bootstrap loop

start_time <- Sys.time()
N <- 10^5
percboot <- numeric(N)
for (i in 1:N)
{samp <- sample(chlor, length(chlor), replace=T)
percboot[i] <- mean(samp)
}
end_time <- Sys.time()
end_time - start_time

Time difference of 5.656472 secs

95% bootstrap confidence interval

hist(percboot)

plot of chunk unnamed-chunk-4

quantile(percboot,c(.025, .975))

     2.5%     97.5% 
 54.98024 104.60336

Ready for t-bootstrap confidence interval

but first

let's think a little more about how t confidence intervals work
an estimate of the true mean, \( \mu \) is given by the confidence interval: \[ \begin{align} \overline{X}-q\frac{S}{\sqrt{n}} &\leq \mu \leq \overline{X}+q\frac{S}{\sqrt{n}}\\ -q\frac{S}{\sqrt{n}} &\leq \mu-\overline{X} \leq +q\frac{S}{\sqrt{n}}\\ -q &\leq \frac{\mu-\overline{X}}{\frac{S}{\sqrt{n}}} \leq +q\\ \end{align} \]

disclaimer

The statistic

\[ \frac{\mu-\overline{X}}{\frac{S}{\sqrt{n}}} \] is usually written

\[ \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}} \]

My way will avoid an small complication with the inequalities

Strategy

The idea is to mimic what we do with a \( t \)-distribution
We will replace \( \mu \) with the sample mean \( \overline{x} \)
We will use a bootstrap mean for \( \overline{X} \): \( \overline{X^*} \)
We will use the standard deviation of the bootstrap resample: \( S^* \)
We will generate a distribution for \[ \frac{\overline{x}-\overline{X^*}}{\frac{S^*}{\sqrt{n}}} \] by running the resample loop many times
We will find bootstrap quantiles, \( q_1 \) and \( q_2 \) such that \[ P\left(q_1 \leq \frac{\overline{x}-\overline{X^*}}{\frac{S^*}{\sqrt{n}}} \leq q_2\right)=1-\alpha \] where \( 1-\alpha \) is the confidence level (eg. 0.95)
Unlike quantiles from a normal or \( t \) distribution \( q_1 \) and \( q_2 \) will not be symmetric

Step One

Let \( \overline{X^*} \) be the mean of a bootstrap resample
Let \( S^* \) be the standard deviation of the bootstrap resample
\( \overline{x} \) is the mean of the sample
\( n \) is the size of the sample
Compute \( \frac{\overline{x}-\overline{X^*}}{\frac{S^*}{\sqrt{n}}} \) for many (\( 10^5 \)) bootstrap resamples

Example (Exercise 7.29)

library(resampledata)
chlor <- na.omit(Bangladesh$Chlorine)

xbar=mean(chlor)
n=length(chlor)

start_time <- Sys.time()
N <- 10^5
bootT <- numeric(N)
for (i in 1:N)
{Xstar <- sample(chlor, n, replace=T)
Sstar <- sd(Xstar)
bootT[i] <- (xbar-mean(Xstar))/(Sstar/sqrt(n))
}
end_time <- Sys.time()
end_time - start_time

Time difference of 13.89275 secs

Step Two

Compute quantiles, \( q_1 \) and \( q_2 \), from this bootstrap \( t \) distribution

quantile(bootT, c(0.025, 0.975))

     2.5%     97.5% 
-1.660158  2.658922

Step Three

compute the confidence interval
you'll need the MOE:
- use the sample standard deviation
- use \( \overline{x} \) from the sample

samplesd <- sd(chlor)
xbar+quantile(bootT, 0.025)*samplesd/sqrt(length(chlor))

    2.5% 
56.82553

xbar+quantile(bootT, 0.975)*samplesd/sqrt(length(chlor))

   97.5% 
112.1318

xbar+quantile(bootT, c(0.025, 0.975))*samplesd/sqrt(length(chlor))

     2.5%     97.5% 
 56.82553 112.13177

Conclusions

The \( t \) confidence interval is almost certainly bad because the distribution of the population is not normal
However, the \( t \)-test confidence interval is fast
The bootstrap-\( t \) confidence interval is by far the most computationally expensive
Is this critical? I honestly don't know. It took about 13 seconds to run for 269 data points. In the “real world” data sets might be much larger.
Which is best? The bootstrap-\( t \) is best.
Why? Running simulations from known populations shows this to be the case. (See section 7.5.3 for this discussion.)