C. Donovan
We have conducted tests and created confidence intervals under distributional assumptions
When these are badly violated, then our inference is unreliable
We can look to non-parametric variants (say based on ranks), however, these do not exist for all things (and answer slightly different questions)
Here we look at very general purpose approaches to inference - effectively statistical simulations
In short we will resample our sample for two purposes:
Maybe revert to non-parametric tests which relax the distributional assumption.
Cheap computing makes possible a very general computationally intensive approach to most problems. Consider the simple case of a two independent sample \( t \)-test:
In short, under the Null Hypothesis our data have arbitrary group labels A or B - they are assigned by chance.
To simulate, we need only randomly assign labels to our observations - this simulates the sampling process under the Null hypothesis (meaning \( H_0 \) is true)
We can see whether the sample we observed is unusual if the Null hypothesis were true, but without assuming a specific distribution.
In this course we have considered main 3 applications of the \( t \)-test (ignoring its use in regression). These are
We consider computer intensive variants of these in turn.
\( H_0 \): \( \mu_1=\mu_2 \),\( \mu_1-\mu_2=0 \) or in plainer language: the means of the two populations we have sampled are equal (alternatively the samples are just drawn from the same population)
\( H_A \): \( \mu_1\ne\mu_2 \), \( \mu_1-\mu_2\ne 0 \) or in plainer language: the means of the two populations we have sampled are not equal (the samples are drawn from statistically distinct populations).
To devise a computer intensive approach we wish to simulate samples drawn assuming the Null hypothesis is true.
One way to approach this is:
set.seed(3245)
# simulate data where there is a difference between groups
GP1 <- data.frame(Group = "GP1", response = rnorm(100, 10, 3))
GP2 <- data.frame(Group = "GP2", response = rnorm(100, 20, 3))
# we'll need this later
estimatedDiff <- mean(GP1$response) - mean(GP2$response)
simData <- rbind(GP1, GP2)
# the parametric tests
t.test(response ~ Group, data = simData)
Welch Two Sample t-test
data: response by Group
t = -22.714, df = 189.32, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.827362 -9.097018
sample estimates:
mean in group GP1 mean in group GP2
10.12634 20.08853
set.seed(3245)
sampData <- simData
# something to store our results
simResults <- numeric(999)
for(i in 1: 999){
# randomly arrange the group labels WRT y
sampData$Group <- sample(simData$Group, 200, replace = F)
# calculate and store the difference (under H0)
means <- by(sampData$response, sampData$Group, mean)
simResults[i] <- means[1]-means[2]
}
hist(simResults, col = "slateblue4", xlim = c(-12, 12))
abline(v = estimatedDiff, lwd = 3)
You'll observe that what we saw in our data is extreme if \( H_0 \) is true
addEst <- c(estimatedDiff, simResults)
locEst <- c(1, rep(0, 999))
locEst<- locEst[order(addEst)]
k <- which(locEst == 1)
min(k/1000, 1-k/1000)
[1] 0.001
Consider the single sample \( t \)-test:
Simulate samples assuming the Null hypothesis is true. One way to approach this is:
NB Equally, you could sample about your estimate and see where the \( H_0 \) value lies (but our narrative is sampling under \( H_0 \))
set.seed(3245)
# simulated data where the mean is 2, plus a lot of noise
GP1 <- data.frame(Group = "GP1", response = rnorm(100, 2, 10))
# we'll need this later
estimatedMean <- mean(GP1$response)
# the parametric test of H0 = 0
t.test(GP1$response, mu = 0)
One Sample t-test
data: GP1$response
t = 2.6419, df = 99, p-value = 0.009582
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.6027571 4.2395348
sample estimates:
mean of x
2.421146
set.seed(3245)
# to store our simulations
simResults <- numeric(999)
for(i in 1: 999){
# get the sampling distribution
simResults[i] <- mean(sample(GP1$response, 100, replace = T))
}
# if H0 is true, it is centred on zero
simResults <- simResults - estimatedMean
hist(simResults, col = "slateblue4")
abline(v = estimatedMean, lwd = 3)
So what we saw in our data is a little extreme if \( H_0 \) is true (so reject )
addEst <- c(estimatedMean, simResults)
locEst <- c(1, rep(0, 999))
locEst<- locEst[order(addEst)]
k <- which(locEst == 1)
min(k/1000, 1-k/1000)*2
[1] 0.01
This is just an extension of the previous examples, but we resample paired differences where the group labels are randomised.
Now consider a test for a slope parameter in a \( x\rightarrow y \) regression (or similarly a correlation).
We are seeking to test if there is a linear association between \( x \) and \( y \) - how to simulate \( H_0 \)?
Some simulated data
set.seed(3245)
# create data
x <- runif(100)
y <- 20 + 2*x + rnorm(100, 0, 2.5)
regData <- data.frame(x, y)
plot(x, y, pch = 20, col = 'purple', cex = 3)
# the parametric model
testLM <- lm(y ~ x, data = regData)
summary(testLM)
Call:
lm(formula = y ~ x, data = regData)
Residuals:
Min 1Q Median 3Q Max
-6.8665 -1.6590 -0.3159 1.7036 5.7652
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.1308 0.4977 40.446 <2e-16 ***
x 2.2300 0.8531 2.614 0.0104 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.553 on 98 degrees of freedom
Multiple R-squared: 0.06518, Adjusted R-squared: 0.05564
F-statistic: 6.833 on 1 and 98 DF, p-value: 0.01036
# we'll need this later
estimatedSlope <- coef(testLM)[2]
We simulate samples drawn assuming the \( H_0 \) is true.
set.seed(3245)
# to store results
simResults <- numeric(999)
simData <- regData
for(i in 1: 999){
# shuffle the x WRT y
simData$y <- sample(regData$y, 100, replace = T)
# fit a model under H0
simLM <- lm(y ~ x, data = simData)
#store the slope
simResults[i] <- coef(simLM)[2]
}
hist(simResults, col = "slateblue4")
abline(v = estimatedSlope, lwd = 3)
So what we observed in our data is a little extreme if \( H_0 \) were true
addEst <- c(estimatedSlope, simResults)
locEst <- c(1, rep(0, 999))
locEst<- locEst[order(addEst)]
k <- which(locEst == 1)
min(k/1000, 1-k/1000)*2
[1] 0.01
We simulate samples drawn assuming the \( H_0 \) is true.
Proceed as for the regression, but use the simulated correlation coefficients
This is frequently used for confidence intervals on parameters, but by resampling to avoid distributional assumptions
Let \( \theta \) be the parameter we are interested in. To make inferences of \( \theta \) from a sample to a population, we must estimate \( \hat{\theta} \)'s sampling distribution. The traditional parametric approach to this task is:
Consider the distribution of the average height of a random samples of 10 women.
If we draw a random sample of 10 women whose average height is 5'9'' with a standard deviation of 3'', our estimate of the sampling distribution of the sample mean of the women's height (for n=10) would be that
Two aspects of the familiar exercise are especially relevant to the discussion of the bootstrap.
(1) Firstly, this parametric t-test tests \( H_0: \mu=\mu_0 \), based on an assumption about \( \bar{x} \)'s sampling distribution
\[ \begin{align*} \frac{\bar{x}-\mu_0}{\hat{\sigma}_{\bar{x}}}\sim t_{df=n-1}. \end{align*} \]
If this condition is not met, an inferential error could be made.
Two aspects of the familiar exercise are especially relevant to the discussion of the bootstrap.
(2) Parametric inference also requires estimates of the parameters of the assumed sampling distribution, e.g. standard deviation and mean in the case of a normal distribution. So we need to know the shape, but have methods/formula for calculating the parameter estimtes.
The problem can be hard: Some statistics (such as the difference between two sample medians) have no such formulas.
So
We simulate the sampling distribution by simulating sampling
Some simulated data and the parameteric result
set.seed(3245)
# create data
x <- runif(100)
y <- 20 + 2*x + rnorm(100, 0, 2.5)
regData <- data.frame(x, y)
plot(x, y, pch = 20, col = 'purple', cex = 3)
We simulate the sampling process
set.seed(3245)
# something to store lots of regression coefficients
bootResults <- array(dim = c(1000, 2))
for(i in 1:1000){
# resample our data with replacement
bootData <- regData[sample(1:100, 100, replace = T),]
# fit the model under this alternative reality
bootLM <- lm(y ~ x, data = bootData)
# store the coefs
bootResults[i, ] <- coef(bootLM)
}
The empirical (bootstrapped) sampling distributions for the parameters
hist(bootResults[,1], col = "slateblue4", main = 'intercept distribution')
hist(bootResults[,2], col = "slateblue4", main = 'slope distribution')
Estimates by the quantiles (encapsulate the central 95% of the sampling distribution for a 95% CI)
# our best guesses of parameters
c(mean(bootResults[,1]), mean(bootResults[,2]))
[1] 20.114824 2.271685
# the CIs for these
rbind(quantile(bootResults[,1], probs = c(0.025, 0.975)),
quantile(bootResults[,2], probs = c(0.025, 0.975)))
2.5% 97.5%
[1,] 19.036846 21.160962
[2,] 0.671194 3.907041
Compared to the parametric
coef(testLM)
(Intercept) x
20.130780 2.230037
confint(testLM)
2.5 % 97.5 %
(Intercept) 19.1430677 21.118492
x 0.5371175 3.922956
Advantages of bootstrap
Disadvantages of bootstrap
Futher complications:
We've covered:
Upcoming: