Bootstrap t

Z confidence intervals

  • if underlying population is normal
  • and we know \( \sigma \)

t confidence intervals

  • if the underlying population is normal,
  • or at least symmteric and we have a large sample size
  • we use the sample standard deviation (aka standard error) instead of \( \sigma \)

not normal? not symmetric?

  • we need a sampling distribution to use instead of the t-distribution
  • bootstrap
  • previously we saw a percentage bootstrap confidence interval
  • now we'll study a version called the bootstrap t

Data example

  • Load the Bangladesh data set
  • make a vector of the chlorine levels
  • make sure to omit the NA values
  • Look at and comment on the sample data
library(resampledata)
chlor <- na.omit(Bangladesh$Chlorine)
hist(chlor)

plot of chunk unnamed-chunk-1

The distribution is skewed strongly to the right.

  • Because the data is skewed a \( t \) confidence interval is not appropriate.
  • Nonetheless, for comparison purposes, find a 95% \( t \) confidence interval for the mean cholorine level
  • Let's do a new comparison as we go.
  • Go back and time the t-test by adding the time commands:
start_time <- Sys.time()
t.test(chlor)

    One Sample t-test

data:  chlor
t = 6.0979, df = 268, p-value = 3.736e-09
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  52.87263 103.29539
sample estimates:
mean of x 
 78.08401 
end_time <- Sys.time()
end_time - start_time
Time difference of 0.007549047 secs

bootstrap percentile confidence interval

  • Find a 95% bootstrap confidence interval
  • This is what we did in Chapter 5.
  • Remember the idea:
    • do a bootstrap resample
    • compute its mean
    • store that mean in a vector
    • repeat 105 times
    • find the middle 95% of the bootstrap distribution
  • time the bootstrap loop

Bootstrap loop

start_time <- Sys.time()
N <- 10^5
percboot <- numeric(N)
for (i in 1:N)
{samp <- sample(chlor, length(chlor), replace=T)
percboot[i] <- mean(samp)
}
end_time <- Sys.time()
end_time - start_time
Time difference of 5.656472 secs

95% bootstrap confidence interval

hist(percboot)

plot of chunk unnamed-chunk-4

quantile(percboot,c(.025, .975))
     2.5%     97.5% 
 54.98024 104.60336 

Ready for t-bootstrap confidence interval

but first

  • let's think a little more about how t confidence intervals work
  • an estimate of the true mean, \( \mu \) is given by the confidence interval: \[ \begin{align} \overline{X}-q\frac{S}{\sqrt{n}} &\leq \mu \leq \overline{X}+q\frac{S}{\sqrt{n}}\\ -q\frac{S}{\sqrt{n}} &\leq \mu-\overline{X} \leq +q\frac{S}{\sqrt{n}}\\ -q &\leq \frac{\mu-\overline{X}}{\frac{S}{\sqrt{n}}} \leq +q\\ \end{align} \]

disclaimer

  • The statistic

\[ \frac{\mu-\overline{X}}{\frac{S}{\sqrt{n}}} \] is usually written

\[ \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}} \]

  • My way will avoid an small complication with the inequalities

Strategy

  • The idea is to mimic what we do with a \( t \)-distribution
  • We will replace \( \mu \) with the sample mean \( \overline{x} \)
  • We will use a bootstrap mean for \( \overline{X} \): \( \overline{X^*} \)
  • We will use the standard deviation of the bootstrap resample: \( S^* \)
  • We will generate a distribution for \[ \frac{\overline{x}-\overline{X^*}}{\frac{S^*}{\sqrt{n}}} \] by running the resample loop many times
  • We will find bootstrap quantiles, \( q_1 \) and \( q_2 \) such that \[ P\left(q_1 \leq \frac{\overline{x}-\overline{X^*}}{\frac{S^*}{\sqrt{n}}} \leq q_2\right)=1-\alpha \] where \( 1-\alpha \) is the confidence level (eg. 0.95)
  • Unlike quantiles from a normal or \( t \) distribution \( q_1 \) and \( q_2 \) will not be symmetric

Step One

  • Let \( \overline{X^*} \) be the mean of a bootstrap resample
  • Let \( S^* \) be the standard deviation of the bootstrap resample
  • \( \overline{x} \) is the mean of the sample
  • \( n \) is the size of the sample
  • Compute \( \frac{\overline{x}-\overline{X^*}}{\frac{S^*}{\sqrt{n}}} \) for many (\( 10^5 \)) bootstrap resamples

Example (Exercise 7.29)

library(resampledata)
chlor <- na.omit(Bangladesh$Chlorine)

xbar=mean(chlor)
n=length(chlor)

start_time <- Sys.time()
N <- 10^5
bootT <- numeric(N)
for (i in 1:N)
{Xstar <- sample(chlor, n, replace=T)
Sstar <- sd(Xstar)
bootT[i] <- (xbar-mean(Xstar))/(Sstar/sqrt(n))
}
end_time <- Sys.time()
end_time - start_time
Time difference of 13.89275 secs

Step Two

  • Compute quantiles, \( q_1 \) and \( q_2 \), from this bootstrap \( t \) distribution
quantile(bootT, c(0.025, 0.975))
     2.5%     97.5% 
-1.660158  2.658922 

Step Three

  • compute the confidence interval
  • you'll need the MOE:
    • use the sample standard deviation
    • use \( \overline{x} \) from the sample

samplesd <- sd(chlor)
xbar+quantile(bootT, 0.025)*samplesd/sqrt(length(chlor))
    2.5% 
56.82553 
xbar+quantile(bootT, 0.975)*samplesd/sqrt(length(chlor))
   97.5% 
112.1318 
xbar+quantile(bootT, c(0.025, 0.975))*samplesd/sqrt(length(chlor))
     2.5%     97.5% 
 56.82553 112.13177 

Conclusions

  • The \( t \) confidence interval is almost certainly bad because the distribution of the population is not normal
  • However, the \( t \)-test confidence interval is fast
  • The bootstrap-\( t \) confidence interval is by far the most computationally expensive
  • Is this critical? I honestly don't know. It took about 13 seconds to run for 269 data points. In the “real world” data sets might be much larger.
  • Which is best? The bootstrap-\( t \) is best.
  • Why? Running simulations from known populations shows this to be the case. (See section 7.5.3 for this discussion.)