Class 4A - Bootstrapping

What if you have non-normal data?

Typically, when making inference for a mean we want to have:

Independence
Sampling from a random/representative mechanism
Data has a approximately normal distributions AND/OR there is a large enough sample size

However, in reality, we may have non-normal data and a small sample size.

Example 1: How much oil?

How much oil will ultimately be produced by wells in a given field is key information in deciding whether to drill more wells. We have data for the estimated total amounts of oil recovered from 64 wells in the Devonian Richmond Dolomite area of the Michigan basin, in thousands of barrels. Take these wells to be an SRS of wells in the area.

Sample Statistics:

\(n=64\)
\(\bar{x}=48.25\)
\(s=40.24\)

Part A: Give a 95% t-confidence interval for the mean amount of oil recovered from all wells in this area.

oil<-read.csv("oil.csv", 
              header=TRUE)

head(oil)

##   barrels
## 1    21.7
## 2    43.4
## 3    79.5
## 4    82.2
## 5    56.4
## 6    36.6

# Calculate components of test statistic
n<-dim(oil)[1]
xbar<-mean(oil$barrels)
xbar

## [1] 48.24844

sd<-sd(oil$barrels)
sd

## [1] 40.23979

se<-sd/sqrt(n)

# Critical Value
qt(0.975, df=63)

## [1] 1.998341

# confidence interval
xbar+c(-1,1)*qt(0.975, df=63)*sd/sqrt(n)

## [1] 38.19684 58.30004

We can also use the built in t-test function.

# one-sample t-test
t.test(oil)

## 
##  One Sample t-test
## 
## data:  oil
## t = 9.5922, df = 63, p-value = 6.198e-14
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  38.19684 58.30004
## sample estimates:
## mean of x 
##  48.24844

Part B: Make a graph to describe the data.

# histogram of barrels
hist(oil$barrels)

Part C: Comparing Methods

A computer intensive methods that gives accurate confidence intervals without assuming any specific shape of the distribution gives a 95% confidence interval of 40.28 to 60.32.

How does the t-interval compare with this?

The t-interval is larger has a smaller lower bound and smaller upper bound

Is it ok to use t-procedures with these data?

Is seems ok to use t-procedures because there is a large sample size.

Bootstrapping!

Bootstrapping is a highly flexible re-sampling procedure that can be used to estimate the sampling distribution of any statistic!

Here is the bootstrap function that I wrote to estimate the sampling distribution of the mean, which will then be used to create a confidence interval for a mean.

# Function for bootstrap confidence interval 
# for one sample mean
bootStrapCI1<-function(data, nsim){
  n<-length(data)
  bootCI<-c()
  
  for(i in 1:nsim){
    bootSamp<-sample(1:n, n, replace=TRUE)
    thisXbar<-mean(data[bootSamp])
    bootCI<-c(bootCI, thisXbar)
  }
  return(bootCI)
}

This is a function that take in a data set and the number of simulations that we want to run (this should be at least 1000).

There are a couple different ways we can calculate the confidence interval:

oilBootCI<-bootStrapCI1(oil$barrels, nsim=1000)

# Quantile Method 
quantile(oilBootCI, c(0.025, 0.975))

##     2.5%    97.5% 
## 39.26730 58.13324

# 2.5%      97.5% 
# 39.45738  59.21430 

# Hybrid Method (T-critical with Bootstrap SE)
se<-sd(oilBootCI)
mean(oil$barrels)+c(-1,1)*qt(0.975, df=n-1)*se

## [1] 38.63612 57.86075

# 38.22158 58.27530

When comparing these methods, we found that the confidence intervals weren’t very different. This is because we do have a large enough sample size, despite the lack of normality.

Let’s try a different example where we have a smaller sample size and a non-normal distribution.

Example 2: E. coli in swimming areas

To investigate water quality, the Columbus Dispatch took water samples at 16 Ohio State Park swimming areas in central Ohio. Those samples were taken to laboratories and tested for E.coli, which are bacteria that can cause serious gastrointestinal problems. If a 100-milliliter sample (about 3.3 ounces) of water contains more than 130 E. coli bacteria, it is considered unsafe. Here are summary statistics for the E. coli levels found by the laboratories:

\(n=16\)
\(\bar{x}=56.28\)
\(s=77.29\)

Take these water samples to be an SRS of the water in all swimming areas in central Ohio.

PART A: Construct a 95% confidence interval for the mean E.coli levels.

## Ex 2: Ecoli 
ecoli<-c(291.0, 190.4, 47.0, 86.0, 44.0, 18.9, 1.0, 50.0,
         10.9, 45.7, 28.5, 8.6, 9.6, 16.0, 34.0, 18.9)

# By hand
xbar<-mean(ecoli) #56.28125
sd<-sd(ecoli) #77.28992
n<-length(ecoli) #16

se<-sd/sqrt(n) #19.32248
mu_0<-130 # One-sided upper alternative 

test_stat<-(xbar-mu_0)/se
test_stat #-3.81518

## [1] -3.81518

pt(test_stat, df=n-1, lower.tail=F)

## [1] 0.999155

# 0.999155
# fail to rejec the null, no evidence for alt

PART B: Make a graph of the data. Describe the shape.

hist(ecoli)

PART C: Due to the strong skew and small sample size, create a bootstrap confidence interval for the mean.

# Bootstrap confidence interval for the mean
ecoliBootCI<-bootStrapCI1(ecoli, nsim=10000)

# Approximated sampling distribution
hist(ecoliBootCI)

# Quantile Method 
quantile(ecoliBootCI, c(0.025, 0.975))

##     2.5%    97.5% 
## 25.48078 96.81906

# 2.5%      97.5% 
# 25.22484  97.02016  

# Hybrid Method (T-critical with Bootstrap SE)
se<-sd(ecoliBootCI)
mean(ecoli)+c(-1,1)*qt(0.975, df=n-1)*se

## [1] 16.28289 96.27961

# 18.9828 93.5797

PART D: Compare the three different confidence intervals. Which would you believe?

PART E: Are these data good evidence that on average the E. coli levels in these swimming areas were unsafe?

First, let’s consider a typical t-test approach.

# Using t.test
t.test(ecoli, mu=mu_0, alternative="greater")

## 
##  One Sample t-test
## 
## data:  ecoli
## t = -3.8152, df = 15, p-value = 0.9992
## alternative hypothesis: true mean is greater than 130
## 95 percent confidence interval:
##  22.40797      Inf
## sample estimates:
## mean of x 
##  56.28125

Bootstrap Hypothesis Tests

Here is the bootstrap function that I wrote to perform a bootstrap hypothesis test for the mean:

# Function for bootstrap hypothesis test
# for one sample mean
bootStrapHyp1<-function(data, mu0, nsim, alternative){
  testStat<-mean(data)
  nullDat<-data-mean(data)+mu0
  n<-length(data)
  
  bootNull<-c()
  
  for(i in 1:nsim){
    bootSamp<-sample(1:n, n, replace=TRUE)
    thisXbar<-mean(nullDat[bootSamp])
    bootNull<-c(bootNull, thisXbar)
  }
  
  if(alternative=="greater"){
    pval<-mean(bootNull>=testStat)
  }
  if(alternative=="less"){
    pval<-mean(bootNull<=testStat)
  }
  if(alternative=="not equal"){
    pval<-min(mean(bootNull>=testStat), mean(bootNull<=testStat))*2
  }
  return(pval)
}

PART F: Perform a bootstrap hypothesis test to assess whether there is data good evidence that on average the E. coli levels in these swimming areas were unsafe.

ecoliBootHyp<-bootStrapHyp1(ecoli, mu0=130, 
                            nsim=10000, alternative = "greater")
ecoliBootHyp

## [1] 1

The Bootstrap method of highly flexible!!!

What if we want to make inference about the median:

## Bootstrap for median 
nsim<-1000
bootMed<-c()
for(i in 1:nsim){
  bootSamp<-sample(1:n, n, replace=TRUE)
  thisMed<-quantile(ecoli[bootSamp], 0.5)
  bootMed<-c(bootMed, thisMed)
}

hist(bootMed)

quantile(bootMed, c(0.025, 0.975))

##  2.5% 97.5% 
##    16    47

What if we want to make inference about the standard deviation:

## Bootstrap for standard deviation 
bootSd<-c()
for(i in 1:nsim){
  bootSamp<-sample(1:n, n, replace=TRUE)
  thisSd<-sd(ecoli[bootSamp])
  bootSd<-c(bootSd, thisSd)
}

hist(bootSd)

quantile(bootSd, c(0.025, 0.975))

##     2.5%    97.5% 
##  19.0060 110.4278

Two Sample Randomization Methods

Example 3: Tropical Flowers

Different varieties of the tropical flower Heliconia are fertilized by different species of hummingbirds. Over time, the lengths of the flowers and the forms of the hummingbirds’ beaks have evolved to match each other. Data on the lengths in millimeters of two color varieties of the same species of flower on the island of Dominica can be found in the R script.

# H. caribaea RED
red<-c(42.90, 42.01, 41.93, 43.09, 41.47, 41.69, 39.78, 
       39.63, 42.18, 40.66, 37.87, 39.16, 37.40, 38.20,
       38.10, 37.97, 38.79, 38.23, 38.87, 37.78, 38.01)

# H. caribaea YELLOW
yellow<-c(36.78, 37.02, 36.52, 36.11, 36.03, 35.45, 38.13,
          37.1, 35.17, 36.82, 36.66, 35.68, 36.03, 34.57, 34.63)

Let’s check the normality assumption

qqnorm(red)
qqline(red)

hist(red)

Traditionally, we might use a two sample t-test to compare the two means of the red and yellow Heliconia flowers:

t.test(red, yellow)

## 
##  Welch Two Sample t-test
## 
## data:  red and yellow
## t = 7.4254, df = 31.306, p-value = 2.171e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.623335 4.609046
## sample estimates:
## mean of x mean of y 
##  39.79619  36.18000

How does this compare to inference from a randomization method?

Two Sample Bootstrap Confidence Interval for Difference in Means

Here is a function to construct the bootstrap distribution of the differences in means, which we will later use to calculate the confidence intervals.

# Function for bootstrap confidence interval
# for difference in two sample means
bootStrapCI2<-function(data1, data2, nsim){
  
  n1<-length(data1)
  n2<-length(data2)
  
  bootCI2<-c()
  
  for(i in 1:nsim){
    bootSamp1<-sample(1:n1, n1, replace=TRUE)
    bootSamp2<-sample(1:n2, n2, replace=TRUE)
    thisXbar<-mean(data1[bootSamp1])-mean(data2[bootSamp2])
    bootCI2<-c(bootCI2, thisXbar)
  }
  
  return(bootCI2)
}

Let’s compute the confidence intervals using the quantile and hybrid methods:

flowerBootCI<-bootStrapCI2(red, yellow, nsim=10000)

# Quantile Method 
quantile(flowerBootCI, c(0.025, 0.975))

##     2.5%    97.5% 
## 2.713321 4.561629

# 2.5%    97.5% 
# 2.701988 4.555755 

# Hybrid Method: Parametric critical value and bootstrap SE
bootSE<-sd(flowerBootCI)
(mean(red)-mean(yellow))+c(-1,1)*qt(0.975, df=14)*bootSE

## [1] 2.599046 4.633335

# 2.595753 4.636628

Two Sample Boostrap Hypothesis Test

Here is some code to carry out a hypothesis test and produce the desired p-value.

Note: Think carefully about the way the null hypothesis is imposed to construct the null distribution.

# Function for bootstrap hypothesis test 
# for difference in two sample means
bootStrapHyp2<-function(data1, data2, nsim, alternative){
  
  testStat<-mean(data1)-mean(data2)
  n1<-length(data1)
  n2<-length(data2)
  allDat<-c(data1, data2)
  
  bootNull<-c()
  
  for(i in 1:nsim){
    bootSamp<-sample(1:(n1+n2), (n1+n2), replace=TRUE)
    boot1<-bootSamp[1:n1]
    boot2<-bootSamp[(n1+1):(n1+n2)]
    thisXbar<-mean(allDat[boot1])-mean(allDat[boot2])
    bootNull<-c(bootNull, thisXbar)
  }
  
  if(alternative=="greater"){
    pval<-mean(bootNull>=testStat)
  }
  if(alternative=="less"){
    pval<-mean(bootNull<=testStat)
  }
  if(alternative=="not equal"){
    pval<-min(mean(bootNull>=testStat), mean(bootNull<=testStat))*2
  }
  return(pval)
}

Applying this to the Heliconia example we find:

# Applying this function to the red and yellow flowers
flowerBoot<-bootStrapHyp2(red, yellow, nsim=10000, alternative="not equal")
flowerBoot

## [1] 0

Permutation Test for Difference in Means

Here is code to carrying out a permutation test.

Note the difference between the bootstrap methods and the permutation methods.

# Function for permutation test for
# for difference in two sample means
permHyp2<-function(data1, data2, nsim, alternative){
  testStat<-mean(data1)-mean(data2)
  n1<-length(data1)
  n2<-length(data2)
  allDat<-c(data1, data2)
  
  permNull<-c()
  
  for(i in 1:nsim){
    permSamp<-sample(1:(n1+n2), n1, replace=FALSE)
    thisXbar<-mean(allDat[permSamp])-mean(allDat[-permSamp])
    permNull<-c(permNull, thisXbar)
  }
  
  if(alternative=="greater"){
    pval<-mean(permNull>=testStat)
  }
  if(alternative=="less"){
    pval<-mean(permNull<=testStat)
  }
  if(alternative=="not equal"){
    pval<-min(mean(permNull>=testStat), mean(permNull<=testStat))*2
  }
  return(pval)
}

If we were to do all permutations for the Heliconia example that would be alot!

# Number of combinations needed for exact test
choose(36,21) #5567902560

## [1] 5567902560

So instead we will carry out the permutation many many times:

flowerPerm<-permHyp2(red, yellow, nsim=100000, alternative="not equal")
flowerPerm #0

## [1] 0