Given a population(e.g. measurements of the height of all high school students in Taiwan), we want to know the statistics of this population (e.g. the mean and variance). To do so, we often take samples from the population and then calculate the statistics of this sample. By assuming the distribution of the statistics, we can make inferences about the statistics of the population. For example, when we found that the sample we took from the population follows normal distribution, we can say the mean of this sample should also follow normal distribution with the same mean. Accordingly, we can make inferences about the population mean.
However, the sample distribution is often hard to obtain (so we need to make assumptions) and so does the population distribution. We therefore use the available samples to estimate the empirical cumulative density function (CDF). From this empirical CDF, we resample many many times to estimate the sampling distribution of any statistics of interest. The procedure of resampling many many times is called the resampling procedure. From each resampling, we can calculate the statistics of interest, so the distribution constructed from these many resamples are the “sampling distribution of the statistics” of interest. According to the Central Limit Theorem, sampling distribution of the statistics should follow a normal distribution.
*Note that this resampling procedures do NOT give us addition information than the sample (data)! It is just another method to estimate the sampling distribution of the statistics of interest.
In short, resampling method has the following advantages:
1. Fewer assumptions.
2. Greater accuracy.
3. Generality.
General reampling procedures:
1. Create many samples with (seldomly without) replacement.
2. Calculate the statistics of interest (e.g. mean, variance).
3. The distribution of these statistics infers the information about the sampling distribution of the statistics of interest.
Let’s try to estimate the distribution of sample mean and variance.
First we generate a fake population consisting of 1032 measurements. These measurements are randomly drawn from a normal distribution with mean equals to 20 and variance equals to 9.
set.seed(2375)
pop = rnorm(1032, mean=20, sd=3)
The true population mean (mean of X) is 20.1247532 and the true variance is 8.4360679.
Say we wanted estimate the mean and variance of this population, so we sampled 32 measurements from the population. The sample()
function can be used to do this sampling.
set.seed(5687)
samp = sample(pop, size=32, replace=FALSE)
We can calculate the sample mean (20.3169353) and variance (6.3085484). However, we have now clue to know the distribution of mean and variance of this sample because this sample does not follows normal distribution (try the shapiro.test()
). So that we can not use t-test to examine if this sample mean is significantly different from a certain value (e.g. the true population mean, 20).
Fortunately, we can use resampling method to construct the distribution of the sample mean and variance based on the empirical sample distribution.
Again we use the sample()
function to do the resampling method.
resamps = replicate(9999, sample(samp, size=length(samp), TRUE), simplify = FALSE)
resamps[[10000]] = samp
resamp.mean = sapply(resamps, mean, simplify=TRUE)
resamp.var = sapply(resamps, var, simplify=TRUE)
bt.mean = mean(resamp.mean)
bt.var = mean(resamp.var)
The mean from 10000 resampling is 20.3172611 and the variance is 6.1105372. This particular resampling method is called non-parametric bootstrap. We thus call the mean and variance of these 10000 means the bootstrap mean and the bootstrap variance.
We can see that the bootstrap mean and bootstrap variance are really close to the sample mean (20.3169353) and sample variance (6.3085484). We can also plot the histogram to see how close they are.
hist(samp, breaks = 20, prob = TRUE, col=rgb(0,1,0,0.3), ylim=c(0, 1), main="density plot")
abline(v=mean(samp), col="red", lwd=3)
hist(resamp.mean, breaks = 40, prob = TRUE, col=rgb(0,0,1,0.3), add=TRUE, ylim=c(0, 1))
lines(density(resamp.mean))
legend("topleft",
legend = c("probability density of empirical sample", "probability density of bootstrap mean"),
text.col=c("green", "blue"),
bty="n")
From this density plot we see that the sample mean (red vertical line) is really close to the peak of the distribution of 1000 means we calculated from resampling.
hist(samp, breaks = 20, prob = TRUE, col=rgb(0,1,0,0.3), ylim=c(0, 0.4), main="density plot")
lines(density(samp), col="blue", lwd=2)
curve(dnorm(x, mean=bt.mean, sd=sqrt(bt.var)), min(samp), max(samp), add=TRUE, col="red", lwd=2)
legend("topleft",
legend = c("probability density of empirical sample",
"probability density of normal distribution with \n bootstrap mean and variance"),
col=c("blue", "red"),
text.col=c("blue", "red"),
lty=c("solid"),
bty="n")
From this density plot we see that the spread of empirical probability (blue) is also pretty similar to the spread of a normal distribution with the bootstrap mean and bootstrap variance. Note that I’m NOT saying that the two distributions are similar! In fact, they are not the same because the red one is a normal distribution and the empirical one is not.
Exercise 1
Plot the histogram of the 10000 resampling mean and variance to reassure yourself that these means and variances are normally distributed.
In addition, we can calculate the standard error from these resampling means and variances. Recall from the t-test session, I have showed you that standard error can be calculated by resampling method. The standard error of the bootstrap mean and variance is just the standard deviation of all the resampling means and variances.
bt.mean.SE = sd(resamp.mean)
bt.var.SE = sd(resamp.var)
So the standard error is 0.4385845 for the bootstrap mean and is 1.8941935 for the bootstrap variance.
With the distribution of these estimated statistics (in this case, mean and variance). We can make inferences (e.g. t-test, ANOVA) about the sample mean and variance, even though the sample it self is not a normal distribution! (yes!)
Another important benefit of resampling method is its convenience and intuitivity of calculating confidence interval (CI). That is to say, resampling method is a valid procedure to calculate CI.
Confidence interval contains information about how the precision of the point estimate. We can use CI to make inference about the point estimate. However, when making statement about CI or interpreting CI, two things should be paying extra attention to:
1. What are model parameters the CI is calculating for?
2. What is the procedures being used to calculate CI?
First, a good model and the parameters in the model is the one that is similar to the true model, which generate the data. Of course we seldom have clue about what the true model is. Therefore we use various kind of method to find the “good” model. For example, the maximum likelihood method is one of the popular method to find a good model.
Second, a valid procedure to calculate X% CI is the one that, when we have a good model and repeat the experiments many times and calculate CI each time, we should be see X% of the CI covers the true parameter value. In other word, we could say, if the model is good, there is X% of chance the CI would cover the true parameter value. Note that it does not mean X% CI will cover the true parameter value with X% chance. It does not, if the model is wrong at the beginning!
The interpreatation of X% CI should be, if the model is good and the procedure to calculate CI is valid, the CI should covers the true parameter value with X% probability.
Let’s take an verbal example. We want to know the distribution of the heights of all Taiwanese high school students, so we did some survey and calculate the mean and variation of the survey. We might find that the survey data follows normal distribution, so we would assume the heights of all Taiwanese high school students follow normal distribution as well. Now, the model is a normal distribution (we want to use a normal distribution to describe the data) and the parameters to be estimated are mean and variance. Then based on the features of normal distribution and assumptions we made, we could use the standard error of the model parameters to calculate their confidence interval (see here if you forgot why). Using standard error of the estimated parameter value is the procedure.
However, if the survey data is not a normal distribution, we can NOT use standard error to calculate the confidence interval of the mean. We will have to either figure out what distribution the survey data follows (which is often pretty challenging), or we can use resampling method! Note that, even when using the resampling method, we still have to justify if the normal distribution is a good model that describes the heights.
After understanding the concept (it is very important!) of confidence interval, let’s really calculate it.
To calculate 95% CI of the bootstrap mean, the simplest way is to order all 10,000 resampling means and take the values of first 2.5% and 97.5%.
CI95.low = sort(resamp.mean)[length(resamp.mean)*0.025]
CI95.hi = sort(resamp.mean)[length(resamp.mean)*0.975]
The high and low 95% confidence intervals are 21.1262455 and 19.4161676.
However, if the bootstrap is perfect, the data sample mean should be at exactly 50% quantile of all the resampling means. But is it the case here?
which(sort(resamp.mean) == mean(samp))/length(resamp.mean)
## [1] 0.4857
We see that it is really close, but it is not exactly 0.5. If this is the case, we will have to do some bias correction.
The basic concept of bias correction is to “move” the whole distribution of resampling means to the location where the data sample mean is at the 50% quantile.
Following the below code, we can calculate the bias corrected CI.
BC.CI = function(bt, theta.hat){
n = length(bt) #get total number of bootstrap estimates
G.theta = which(sort(bt) == theta.hat)/n #get G.theta, which is the position of sample estimate within all bootstraps
if(G.theta==0.5){ #if sample estimate is at 50%, no need for bias correction
low = sort(bt)[n*0.025]
high = sort(bt)[n*0.975]
}
z0=qnorm(G.theta) #get Z0 value, which is the Z score of data sample mean in the Z-distribution (a special case of normal distribution with mean=0 and standard deviation=1)
z.low = 2*z0 - 1.96 #get Z value for lower limit
z.hi = 2*z0 + 1.96 #get Z value for upper limit
low = sort(bt)[round(pnorm(z.low)*n)] #map the Z-lower back to the corresponding position within the bootstrap distribution
high = sort(bt)[round(pnorm(z.hi)*n)] #map the Z-upper back to the corresponding position within the bootstrap distribution
return(rbind(low, high)) #return CI, [1]=lower, [2]=upper
}
BC.CI(resamp.mean, mean(samp))
## [,1]
## low 19.38344
## high 21.10752
There are other bias correction method, like accelerated bias correction. This is more advanced and will not be covered, but again, you are welcome to explore it!
So far, we have used some hypothetical examples to demonstrate you how to do resampling and how it can be used to calculate CI and make inference. Let’s take a look of two practical examples.
The first one is basically the t-test. With resampling method, we don’t have to make assumptions about normal distribution of samples.
The idea is that, if measurements in two groups come from the same population, the expected difference between the two group means should be zero. We therefore have to calculated the expected difference between the two group means and it confidence interval. If the confidence interval covers zero, then we can not reject the hypothesis that the two group measurements are from the same population. This is also called permutation.
To construct the condifence interval of the expected difference, we could shuffle the measurements so that measurements in one group will not necessarily be in the same group. In each shuffling, difference between the two group means is calculated. As such, a distribution and thus the confidence interval of the group mean differences is created.
If you still remember, I have used this method in the t-test session. We now use the same hypothetical data to demonstrate how to use resampling method to do hypothesis test.
muA=4
muB=4.5
sdA=1
sdB=1
set.seed(32)
A = rnorm(30, mean=muA, sd=sdA)
B = rnorm(30, mean=muB, sd=sdB)
hist(A,breaks=30,col=rgb(1,0,0,0.5),xlim=c(0,10), ylim=c(0,10), xlab="Value", main="histogram of sample A and B")
hist(B,breaks=30,col=rgb(0,1,0,0.5),add=TRUE)
box()
Following the same syntax, I calculate 10000 group mean differences and plot the histogram of these group mean differences.
dif = replicate(9999, mean(sample(A, length(A), replace=TRUE) - sample(B, length(B), replace=TRUE)), simplify=TRUE)
dif[[10000]] = mean(A) - mean(B)
hist(dif, breaks=30, xlab="group mean difference", main="")
Exercise 2
With the same hypothetical data A and B, use for loop to do construct the distribution of group mean differences.
What is the 95% confidence interval of the group mean differences?
dif.e = mean(dif)
l = sort(dif)[10000*0.025]
h = sort(dif)[10000*0.975]
The expected group difference is -0.11 and the 95% confidence interval of this expected group mean difference is -0.504 to 0.271.
Exercise 3
BC.CI
function I wrote to you.Challenge yourself How to do resampling depends on the question and the structure of the data? Notice that I do not keep data A and B in pair when doing the resampling method, which is equivalent to the non-paired t-test. You can certainly keep the paired structure when doing the resampling. Whether you should keep the paired structure is determined by yourself.
Similar to the idae of hypothesis testing, we can also use resampling method to make inference about a point estimate. For example, we can use resampling method to see if the regression coefficient is significantly differ from zero.
One way to construct the distribution of the regression coefficient is to first shuffle the independent variables so that the relationship between independent variable and dependent variable becomes random. With the shuffled data, regression coefficient is then calculated. By doing this random shuffling many times, the distribution of the regression coefficient can be calculated. Finally, by finding where the observed regression coefficient is in this distribution, we know the probability of observing this regression coefficient if the relationship between independent and dependent variables is random (the p-value).
Let’s use the air quality data in New York we have seen before to demonstrate.
mod = lm(Temp~Wind, data=airquality)
coef = c()
air.f = airquality
for (i in 1:9999){
id.f = sample(seq(from=1, to=nrow(airquality), by=1), nrow(airquality), replace=TRUE)
air.f[, "Wind"] = airquality[id.f,"Wind"]
mod.f = lm(Temp~Wind, data=air.f)
coef = c(coef, summary(mod.f)$coefficients[2,1])
}
coef[10000] = summary(mod)$coefficients[2,1]
Here I locate where the observed regression coefficient is in the distribution of the 10000 regression coefficients.
which(sort(coef) == summary(mod)$coefficients[2,1])
## [1] 1
This means the probability to observe the observed regression coefficient is less than 0.0001. We can also visulize it.
hist(coef, breaks=30, xlab="regression coefficient", main="")
abline(v=coef[10000], col="red")
legend("topright", legend = c("estimated \n regression \n coefficient"),
col=c("red"),
text.col=c("red"),
lty=c("solid"), bty="n")
Note that using resampling method to calculate the confidence interval of regression coefficient does NOT mean we don’t have to do model checking! Still, we have to be aware of:
- residual independency
- residual homoscedasticity
- residual normality