We have a population that is normally distributed with mean 36 and standard deviation 8.
The sampling distribution for Xbar with size 200 will have what mean, standard error and shape?
First, since the population is normal with mean 36, the sample mean will also be normal with mean 36. The standard error of the sample mean is the standard deviation of the population divided by the square root of the sample size (or 0.5656854.
To draw a random sample of size 200 from our population, we use the rnorm command. Set the seed for reproducibility.
set.seed(1234)
mysamp <- rnorm(200, mean = 36, sd = 8)
EDA could include a histogram, normal quantile plots, and summary statistics.
hist(mysamp)
The histogram shows the sample is approximately normal (fairly symmetric and approximately bell-shaped) and centered pretty close to 36, which is the population mean and the mean of the sampling distribution for the sample mean.
qqnorm(mysamp)
qqline(mysamp)
The qq norm plot shows the data are approximately normal: the dots pretty closely follow the line, meaning the quantiles from the sample pretty closely line up with the quantiles from a normal distribution).
mean(mysamp)
## [1] 35.53793
sd(mysamp)
## [1] 8.165329
Finally, we have some summary statistics. The mean of the sample is 35.5379258 and the standard deviation of the sample is sd(mysamp). These are pretty close to the true mean (36) and the standard error of the sampling distribution (8).
It’s time to bootstrap.
m <- 10^4
xbars <- rep(-1234, m)
n <- 200
set.seed(1234)
for(i in 1:m){
resamp <- sample(mysamp, size = n, replace = TRUE)
xbars[i] <- mean(resamp)
}
Then the bootstrap mean and standard error are:
mean(xbars)
## [1] 35.54354
sd(xbars)
## [1] 0.5793607
The mean of the bootstrap distribution is 35.5435393, which is pretty close to the original sample’s mean 35.5379258. (Remember that the bootstrap distribution is centered around the mean of the original sample!!)
The standard error fo the bootstrap distribution 0.5793607 is pretty close to the standard error of the sampling distribution, which is 0.5656854. We could improve this estimate by upping our number of resamples (m).
First, I set up the skeleton of the table.
mytab <- matrix(data = NA, nrow =4, ncol = 2)
colnames(mytab) <- c("Mean", "Standard Deviation")
rownames(mytab) <- c("Population", "Sampling Distribution of Xbar", "Sample", "Bootstrap Distribution")
mytab
## Mean Standard Deviation
## Population NA NA
## Sampling Distribution of Xbar NA NA
## Sample NA NA
## Bootstrap Distribution NA NA
Next, I fill it in.
mytab[1,] <- c(36, 8)
mytab[2,] <- c(36, 8/sqrt(200))
mytab[3,1] <- mean(mysamp)
mytab[3,2] <- sd(mysamp)
mytab[4,1] <- mean(xbars)
mytab[4,2] <- sd(xbars)
Et voila:
mytab
## Mean Standard Deviation
## Population 36.00000 8.0000000
## Sampling Distribution of Xbar 36.00000 0.5656854
## Sample 35.53793 8.1653294
## Bootstrap Distribution 35.54354 0.5793607
In particular, note that the top two means are the same and the bottom two means are close to each other. Remember, the bootstrap distribution is centered at the same place as the sample.
You will see that as the sample size n decreases from 200 to 50 (and then to 10), the standard error will increase. The standard error of the mean will be approximately 1.1313708 with a sample size of 50 and approximately 2.5298221 with a sample size of 10.
library(resampledata)
##
## Attaching package: 'resampledata'
## The following object is masked from 'package:datasets':
##
## Titanic
data(Bangladesh)
summary(Bangladesh$Chlorine)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 5.00 14.20 78.08 55.50 1550.00 2
We see we have two NA so let’s go ahead and create a data set without those two cases. There are a few ways to do this. To omit just the NAs within the chlorine variable:
completeonly <- Bangladesh[-which(is.na(Bangladesh$Chlorine)),]
dim(completeonly)
## [1] 269 3
There are other commands that will remove rows with an NA in ANY of the variables, but since we’re just interested in Chlorine, it would probably be wise just to throw away observations with missing Chlorine measurements.
Now do some EDA.
summary(completeonly$Chlorine)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 14.20 78.08 55.50 1550.00
sd(completeonly$Chlorine)
## [1] 210.0192
Also, look at the distribution with a histogram:
hist(completeonly$Chlorine)
We can see there are some INTENSE outliers. The outlier is so far away that we can’t see any detail in the rest of the data. If we want to check out a histogram without chlorine values over 500, we could:
We may also be wondering how many wells have Chlorine levels over say 300. Let’s find out.
sum(completeonly$Chlorine > 300)
## [1] 18
Let’s bootstrap the mean.
m <- 10^4
xbars <- rep(-100, m)
n <- nrow(completeonly)
for(i in 1:m){
resamp <- sample(completeonly$Chlorine, n, replace = TRUE)
xbars[i] <- mean(resamp)
}
hist(xbars)
quantile(xbars, c(.025, .975))
## 2.5% 97.5%
## 54.76053 104.58017
A 95% confidence interval for the mean is from 54.7605297 to 104.5801673. We are 95% confident that the true mean chlorine level for wells in Bangladesh is between 54.761 and 104.58.
The bootstrap estimate of the bias (on page 131 of the Chihara Hesterberg text) is the mean of the bootstrap distribution minus the mean of the original sample:
mean(xbars) - mean(completeonly$Chlorine)
## [1] -0.314458
This value is negative, meaning that the bootstrap distribution for the sample mean tends to underestimate the sample mean.
Why does this happen? Remember we had that weird well with a huge Chlorine measurement? When we calculate the mean for the original sample, we included that well. When we create our bootstrap resamples, some of those resamples contained the highly-chlorinated well and others did not. Any bootstrap resample WITHOUT the highly-chlorinated well will have a lower mean.
Next, what fraction of the bootstrap standard error does our bootstrap estimate of the bias represent? Let’s take our bootstrap estimate of the bias and divide it by our bootstrap standard error.
(mean(xbars) - mean(completeonly$Chlorine))/sd(xbars)
## [1] -0.02455708
About 2.5% of the bootstrap standard error. According to Chihara and Hesterberg, a ratio beyond plus or minus 2% is “large enough to potentially have a substancial effect on the accuracy of the confidence intervals.” If you land in this type of situation, you can create better bootstrap confidence intervals than these quantile-based CIs.
If you do more with statistics, you will probably hear about the bias-variance trade-off. To put it simply, a biased estimator is not great, but an estimator with high variance is not great either. Sometimes you can inject a little bias into your estimator and that helps the variance decrease by quite a bit. Therefore, a little bias can be a small price to pay sometimes for lower variance.
Our boxplot shows we have onr intense outlier. In other words, one fish has a scary amount of mercury in it.
data("FishMercury")
hist(FishMercury$Mercury)
boxplot(FishMercury$Mercury)
Bootstrap the mean.
set.seed(1234)
m <- 10^4
xbars <- rep(-10, m)
n <- nrow(FishMercury)
for(i in 1:m){
xbars[i] <- mean(sample(FishMercury$Mercury, n, replace = TRUE))
}
hist(xbars)
The bootstrap standard error is 0.0594361:
sd(xbars)
## [1] 0.05943611
The 95% bootstrap percentile interval is for the mean mercury level (in parts per million) is:
quantile(xbars, c(.025,.975))
## 2.5% 97.5%
## 0.1122000 0.3096025
In other words, we are 95% confident that the mean mercury level in a fish from a Minnesota lake is between 0.1122 and 0.3096.
Remove the outlier and bootstrap the mean of the remaining data.
newdat <- subset(FishMercury, Mercury < max(FishMercury$Mercury))
dim(newdat)
## [1] 29 1
set.seed(1234)
m <- 10^4
xbars <- rep(-10, m)
n <- nrow(newdat)
for(i in 1:m){
xbars[i] <- mean(sample(newdat$Mercury, n, replace = TRUE))
}
hist(xbars)
After omitting the outlier, the 95% bootstrap percentile interval is for the mean mercury level (in parts per million) is:
quantile(xbars, c(.025,.975))
## 2.5% 97.5%
## 0.1082405 0.1387241
In other words, we are 95% confident that the mean mercury level in a fish from a Minnesota lake is between 0.1082 and 0.1387.
The bootstrap standard error is 0.0078384.
What happens when we remove the fish with the ridiculously high mercury level?
First, notice that the new bootstrap standard error is dramatically lower than the bootstrap standard error calculated with all 30 fish.
Also, check out the bootstrap distribution. They have totally different shapes. The bootstrap distribution created without the outlier looks MUCH more normal. The bootstrap distribution created with the outlier is skewed (has a tail on the right because the outlier fish is pulling up the sample mean).