Class 3B - One Sample Bootstrap

What if you have non-normal data?

Typically, when making inference for a mean we want to have:

Independence
Sampling from a random/representative mechanism
Data has a approximately normal distributions AND/OR there is a large enough sample size

However, in reality, we may have non-normal data and a small sample size.

Example 1: How much oil?

How much oil will ultimately be produced by wells in a given field is key information in deciding whether to drill more wells. We have data for the estimated total amounts of oil recovered from 64 wells in the Devonian Richmond Dolomite area of the Michigan basin, in thousands of barrels. Take these wells to be an SRS of wells in the area.

Sample Statistics:

\(n=64\)
\(\bar{x}=48.25\)
\(s=40.24\)

Part A: Give a 95% t-confidence interval for the mean amount of oil recovered from all wells in this area.

oil<-read.csv("oil.csv", 
              header=TRUE)

head(oil)

##   barrels
## 1    21.7
## 2    43.4
## 3    79.5
## 4    82.2
## 5    56.4
## 6    36.6

# Calculate components of test statistic
n<-dim(oil)[1]
xbar<-mean(oil$barrels)
xbar

## [1] 48.24844

sd<-sd(oil$barrels)
sd

## [1] 40.23979

se<-sd/sqrt(n)

# Critical Value
qt(0.975, df=63)

## [1] 1.998341

# confidence interval
xbar+c(-1,1)*qt(0.975, df=63)*sd/sqrt(n)

## [1] 38.19684 58.30004

We can also use the built in t-test function.

# one-sample t-test
t.test(oil)

## 
##  One Sample t-test
## 
## data:  oil
## t = 9.5922, df = 63, p-value = 6.198e-14
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  38.19684 58.30004
## sample estimates:
## mean of x 
##  48.24844

Part B: Make a graph to describe the data.

# histogram of barrels
hist(oil$barrels)

Part C: Comparing Methods

A computer intensive methods that gives accurate confidence intervals without assuming any specific shape of the distribution gives a 95% confidence interval of 40.28 to 60.32.

How does the t-interval compare with this?

The t-interval is larger has a smaller lower bound and smaller upper bound

Is it ok to use t-procedures with these data?

Is seems ok to use t-procedures because there is a large sample size.

Bootstrapping!

Bootstrapping is a highly flexible re-sampling procedure that can be used to estimate the sampling distribution of any statistic!

Here is the bootstrap function that I wrote to estimate the sampling distribution of the mean, which will then be used to create a confidence interval.

# Function for bootstrap confidence interval 
# for one sample mean
bootStrapCI1<-function(data, nsim){
  n<-length(data)
  bootCI<-c()
  
  for(i in 1:nsim){
    bootSamp<-sample(1:n, n, replace=TRUE)
    thisXbar<-mean(data[bootSamp])
    bootCI<-c(bootCI, thisXbar)
  }
  return(bootCI)
}

This is a function that take in a data set and the number of simulations that we want to run (this should be at least 1000).

There are a couple different ways we can calculate the confidence interval:

oilBootCI<-bootStrapCI1(oil$barrels, nsim=1000)

# Quantile Method 
quantile(oilBootCI, c(0.025, 0.975))

##     2.5%    97.5% 
## 39.45105 58.52301

# 2.5%      97.5% 
# 39.45738  59.21430 

# Hybrid Method (T-critical with Bootstrap SE)
se<-sd(oilBootCI)
mean(oil$barrels)+c(-1,1)*qt(0.975, df=n-1)*se

## [1] 38.36360 58.13327

# 38.22158 58.27530

When comparing these methods, we found that the confidence intervals weren’t very different. This is because we do have a large enough sample size, despite the lack of normality.

Let’s try a different example where we have a smaller sample size and a non-normal distribution.

Example 2: E. coli in swimming areas

To investigate water quality, the Columbus Dispatch took water samples at 16 Ohio State Park swimming areas in central Ohio. Those samples were taken to laboratories and tested for E.coli, which are bacteria that can cause serious gastrointestinal problems. If a 100-milliliter sample (about 3.3 ounces) of water contains more than 130 E. coli bacteria, it is considered unsafe. Here are summary statistics for the E. coli levels found by the laboratories:

\(n=16\)
\(\bar{x}=56.28\)
\(s=77.29\)

Take these water samples to be an SRS of the water in all swimming areas in central Ohio.

PART A: Construct a 95% confidence interval for the mean E.coli levels.

## Ex 2: Ecoli 
ecoli<-c(291.0, 190.4, 47.0, 86.0, 44.0, 18.9, 1.0, 50.0,
         10.9, 45.7, 28.5, 8.6, 9.6, 16.0, 34.0, 18.9)

# By hand
xbar<-mean(ecoli) #56.28125
sd<-sd(ecoli) #77.28992
n<-length(ecoli) #16

se<-sd/sqrt(n) #19.32248
mu_0<-130 # One-sided upper alternative 

test_stat<-(xbar-mu_0)/se
test_stat #-3.81518

## [1] -3.81518

pt(test_stat, df=n-1, lower.tail=F)

## [1] 0.999155

# 0.999155
# fail to rejec the null, no evidence for alt

PART B: Make a graph of the data. Describe the shape.

hist(ecoli)

PART C: Due to the strong skew and small sample size, create a bootstrap confidence interval for the mean.

# Bootstrap confidence interval for the mean
ecoliBootCI<-bootStrapCI1(ecoli, nsim=10000)

# Approximated sampling distribution
hist(ecoliBootCI)

# Quantile Method 
quantile(ecoliBootCI, c(0.025, 0.975))

##     2.5%    97.5% 
## 25.80609 96.72609

# 2.5%      97.5% 
# 25.22484  97.02016  

# Hybrid Method (T-critical with Bootstrap SE)
se<-sd(ecoliBootCI)
mean(ecoli)+c(-1,1)*qt(0.975, df=n-1)*se

## [1] 16.58188 95.98062

# 18.9828 93.5797

PART D: Compare the three different confidence intervals. Which would you believe?

PART E: Are these data good evidence that on average the E. coli levels in these swimming areas were unsafe?

First, let’s consider a typical t-test approach.

# Using t.test
t.test(ecoli, mu=mu_0, alternative="greater")

## 
##  One Sample t-test
## 
## data:  ecoli
## t = -3.8152, df = 15, p-value = 0.9992
## alternative hypothesis: true mean is greater than 130
## 95 percent confidence interval:
##  22.40797      Inf
## sample estimates:
## mean of x 
##  56.28125

Bootstrap Hypothesis Tests

Here is the bootstrap function that I wrote to perform a bootstrap hypothesis test for the mean:

# Function for bootstrap hypothesis test
# for one sample mean
bootStrapHyp1<-function(data, mu0, nsim, alternative){
  testStat<-mean(data)
  nullDat<-data-mean(data)+mu0
  n<-length(data)
  
  bootNull<-c()
  
  for(i in 1:nsim){
    bootSamp<-sample(1:n, n, replace=TRUE)
    thisXbar<-mean(nullDat[bootSamp])
    bootNull<-c(bootNull, thisXbar)
  }
  
  if(alternative=="greater"){
    pval<-mean(bootNull>=testStat)
  }
  if(alternative=="less"){
    pval<-mean(bootNull<=testStat)
  }
  if(alternative=="not equal"){
    pval<-min(mean(bootNull>=testStat), mean(bootNull<=testStat))*2
  }
  return(pval)
}

PART F: Perform a bootstrap hypothesis test to assess whether there is data good evidence that on average the E. coli levels in these swimming areas were unsafe.

ecoliBootHyp<-bootStrapHyp1(ecoli, mu0=130, 
                            nsim=10000, alternative = "greater")
ecoliBootHyp

## [1] 1

The Bootstrap method of highly flexible!!!

What if we want to make inference about the median:

## Bootstrap for median 
nsim<-1000
bootMed<-c()
for(i in 1:nsim){
  bootSamp<-sample(1:n, n, replace=TRUE)
  thisMed<-quantile(ecoli[bootSamp], 0.5)
  bootMed<-c(bootMed, thisMed)
}

hist(bootMed)

quantile(bootMed, c(0.025, 0.975))

##    2.5%   97.5% 
## 15.9725 47.0000

What if we want to make inference about the standard deviation:

## Bootstrap for standard deviation 
bootSd<-c()
for(i in 1:nsim){
  bootSamp<-sample(1:n, n, replace=TRUE)
  thisSd<-sd(ecoli[bootSamp])
  bootSd<-c(bootSd, thisSd)
}

hist(bootSd)

quantile(bootSd, c(0.025, 0.975))

##      2.5%     97.5% 
##  17.01372 112.03663