Typically, when making inference for a mean we want to have:
However, in reality, we may have non-normal data and a small sample size.
How much oil will ultimately be produced by wells in a given field is key information in deciding whether to drill more wells. We have data for the estimated total amounts of oil recovered from 64 wells in the Devonian Richmond Dolomite area of the Michigan basin, in thousands of barrels. Take these wells to be an SRS of wells in the area.
Sample Statistics:
oil<-read.csv("oil.csv",
header=TRUE)
head(oil)
## barrels
## 1 21.7
## 2 43.4
## 3 79.5
## 4 82.2
## 5 56.4
## 6 36.6
# Calculate components of test statistic
n<-dim(oil)[1]
xbar<-mean(oil$barrels)
xbar
## [1] 48.24844
sd<-sd(oil$barrels)
sd
## [1] 40.23979
se<-sd/sqrt(n)
# Critical Value
qt(0.975, df=63)
## [1] 1.998341
# confidence interval
xbar+c(-1,1)*qt(0.975, df=63)*sd/sqrt(n)
## [1] 38.19684 58.30004
We can also use the built in t-test function.
# one-sample t-test
t.test(oil)
##
## One Sample t-test
##
## data: oil
## t = 9.5922, df = 63, p-value = 6.198e-14
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 38.19684 58.30004
## sample estimates:
## mean of x
## 48.24844
# histogram of barrels
hist(oil$barrels)
A computer intensive methods that gives accurate confidence intervals without assuming any specific shape of the distribution gives a 95% confidence interval of 40.28 to 60.32.
How does the t-interval compare with this?
Is it ok to use t-procedures with these data?
Bootstrapping is a highly flexible re-sampling procedure that can be used to estimate the sampling distribution of any statistic!
Here is the bootstrap function that I wrote to estimate the sampling distribution of the mean, which will then be used to create a confidence interval.
# Function for bootstrap confidence interval
# for one sample mean
bootStrapCI1<-function(data, nsim){
n<-length(data)
bootCI<-c()
for(i in 1:nsim){
bootSamp<-sample(1:n, n, replace=TRUE)
thisXbar<-mean(data[bootSamp])
bootCI<-c(bootCI, thisXbar)
}
return(bootCI)
}
This is a function that take in a data set and the number of simulations that we want to run (this should be at least 1000).
There are a couple different ways we can calculate the confidence interval:
oilBootCI<-bootStrapCI1(oil$barrels, nsim=1000)
# Quantile Method
quantile(oilBootCI, c(0.025, 0.975))
## 2.5% 97.5%
## 39.45105 58.52301
# 2.5% 97.5%
# 39.45738 59.21430
# Hybrid Method (T-critical with Bootstrap SE)
se<-sd(oilBootCI)
mean(oil$barrels)+c(-1,1)*qt(0.975, df=n-1)*se
## [1] 38.36360 58.13327
# 38.22158 58.27530
When comparing these methods, we found that the confidence intervals weren’t very different. This is because we do have a large enough sample size, despite the lack of normality.
Let’s try a different example where we have a smaller sample size and a non-normal distribution.
To investigate water quality, the Columbus Dispatch took water samples at 16 Ohio State Park swimming areas in central Ohio. Those samples were taken to laboratories and tested for E.coli, which are bacteria that can cause serious gastrointestinal problems. If a 100-milliliter sample (about 3.3 ounces) of water contains more than 130 E. coli bacteria, it is considered unsafe. Here are summary statistics for the E. coli levels found by the laboratories:
Take these water samples to be an SRS of the water in all swimming areas in central Ohio.
## Ex 2: Ecoli
ecoli<-c(291.0, 190.4, 47.0, 86.0, 44.0, 18.9, 1.0, 50.0,
10.9, 45.7, 28.5, 8.6, 9.6, 16.0, 34.0, 18.9)
# By hand
xbar<-mean(ecoli) #56.28125
sd<-sd(ecoli) #77.28992
n<-length(ecoli) #16
se<-sd/sqrt(n) #19.32248
mu_0<-130 # One-sided upper alternative
test_stat<-(xbar-mu_0)/se
test_stat #-3.81518
## [1] -3.81518
pt(test_stat, df=n-1, lower.tail=F)
## [1] 0.999155
# 0.999155
# fail to rejec the null, no evidence for alt
hist(ecoli)
# Bootstrap confidence interval for the mean
ecoliBootCI<-bootStrapCI1(ecoli, nsim=10000)
# Approximated sampling distribution
hist(ecoliBootCI)
# Quantile Method
quantile(ecoliBootCI, c(0.025, 0.975))
## 2.5% 97.5%
## 25.80609 96.72609
# 2.5% 97.5%
# 25.22484 97.02016
# Hybrid Method (T-critical with Bootstrap SE)
se<-sd(ecoliBootCI)
mean(ecoli)+c(-1,1)*qt(0.975, df=n-1)*se
## [1] 16.58188 95.98062
# 18.9828 93.5797
First, let’s consider a typical t-test approach.
# Using t.test
t.test(ecoli, mu=mu_0, alternative="greater")
##
## One Sample t-test
##
## data: ecoli
## t = -3.8152, df = 15, p-value = 0.9992
## alternative hypothesis: true mean is greater than 130
## 95 percent confidence interval:
## 22.40797 Inf
## sample estimates:
## mean of x
## 56.28125
Here is the bootstrap function that I wrote to perform a bootstrap hypothesis test for the mean:
# Function for bootstrap hypothesis test
# for one sample mean
bootStrapHyp1<-function(data, mu0, nsim, alternative){
testStat<-mean(data)
nullDat<-data-mean(data)+mu0
n<-length(data)
bootNull<-c()
for(i in 1:nsim){
bootSamp<-sample(1:n, n, replace=TRUE)
thisXbar<-mean(nullDat[bootSamp])
bootNull<-c(bootNull, thisXbar)
}
if(alternative=="greater"){
pval<-mean(bootNull>=testStat)
}
if(alternative=="less"){
pval<-mean(bootNull<=testStat)
}
if(alternative=="not equal"){
pval<-min(mean(bootNull>=testStat), mean(bootNull<=testStat))*2
}
return(pval)
}
ecoliBootHyp<-bootStrapHyp1(ecoli, mu0=130,
nsim=10000, alternative = "greater")
ecoliBootHyp
## [1] 1
What if we want to make inference about the median:
## Bootstrap for median
nsim<-1000
bootMed<-c()
for(i in 1:nsim){
bootSamp<-sample(1:n, n, replace=TRUE)
thisMed<-quantile(ecoli[bootSamp], 0.5)
bootMed<-c(bootMed, thisMed)
}
hist(bootMed)
quantile(bootMed, c(0.025, 0.975))
## 2.5% 97.5%
## 15.9725 47.0000
What if we want to make inference about the standard deviation:
## Bootstrap for standard deviation
bootSd<-c()
for(i in 1:nsim){
bootSamp<-sample(1:n, n, replace=TRUE)
thisSd<-sd(ecoli[bootSamp])
bootSd<-c(bootSd, thisSd)
}
hist(bootSd)
quantile(bootSd, c(0.025, 0.975))
## 2.5% 97.5%
## 17.01372 112.03663