MT5762 Lecture 19

C. Donovan

Computer intensive inference

We have conducted tests and created confidence intervals under distributional assumptions

When these are badly violated, then our inference is unreliable

We can look to non-parametric variants (say based on ranks), however, these do not exist for all things (and answer slightly different questions)

Here we look at very general purpose approaches to inference - effectively statistical simulations

Computer intensive approaches

In short we will resample our sample for two purposes:

  • Randomisation tests Generate parameter distributions assuming \( H_0 \) is true, to get \( p \)-values
  • Bootstrapping Generate approximate sampling distributions of parameters to get confidence intervals

Calculations for tests assume \(H_0\) is true

  • We often use a theoretical Null' distribution to determine the probability of having observed data like ours (or more extreme)
  • This requires we assume the form of the parameter's sampling distribution, assuming \( H_0 \) to be true
  • This stems in part from the distribution of the population we're sampling, although the CLT helps for some things

We like to avoid assumptions

  • If this assumption is badly violated then our calculated probabilities will be quite wrong.
  • We check this assumption and may rely on the Central Limit Theorem.
  • If this isn't sufficient, what other methods are open to us?

We like to avoid assumptions

Maybe revert to non-parametric tests which relax the distributional assumption.

  • Many 'traditional' non-parametric tests match parametric methods we have seen e.g. Mann-Whitney U tests, Kruskal-Wallis, the Sign test etc.
  • Usually these methods are based on the ranking of observations and the assumptions are not terribly restrictive.
  • (The name non-parametric suggests without parameters - but often where parameters have no specific interpretation or too numerous)

Approach the problem via simulation

Cheap computing makes possible a very general computationally intensive approach to most problems. Consider the simple case of a two independent sample \( t \)-test:

  • we have a Null Hypothesis that states the samples have actually been drawn from the same population
    • to generate the \( p \)-value we assume a distribution for the differences that has mean 0
    • a variance of this parameter is created/estimated by combining the variances from both the samples.

Approach the problem via simulation

In short, under the Null Hypothesis our data have arbitrary group labels A or B - they are assigned by chance.

  • To simulate, we need only randomly assign labels to our observations - this simulates the sampling process under the Null hypothesis (meaning \( H_0 \) is true)

  • We can see whether the sample we observed is unusual if the Null hypothesis were true, but without assuming a specific distribution.

Randomisation tests - \(t\)-test equivalents

Approach the problem via simulation

In this course we have considered main 3 applications of the \( t \)-test (ignoring its use in regression). These are

  • comparing the means of two sampled populations,
  • comparing the mean of a sampled population to that of some hypothesised value,
  • comparing the variant for analysing paired observations.

We consider computer intensive variants of these in turn.

two-sample \(t\)-test - via randomisation

  • \( H_0 \): \( \mu_1=\mu_2 \),\( \mu_1-\mu_2=0 \) or in plainer language: the means of the two populations we have sampled are equal (alternatively the samples are just drawn from the same population)

  • \( H_A \): \( \mu_1\ne\mu_2 \), \( \mu_1-\mu_2\ne 0 \) or in plainer language: the means of the two populations we have sampled are not equal (the samples are drawn from statistically distinct populations).

To devise a computer intensive approach we wish to simulate samples drawn assuming the Null hypothesis is true.

two-sample \(t\)-test - via randomisation

One way to approach this is:

  • Randomly jumble the identifiers for your two group labels relative to their sampled response values.
  • Calculate the difference between these two randomized groups.
  • Store this result and repeat the process 999 times.
  • You now have a distribution of differences under the Null hypothesis - the mean of this set of values should be about zero.

two-sample \(t\)-test - randomisation

  • To determine the \( p \)-value for your actual test result, append the difference you actually observed into the set of randomised values and sort.
  • The approximate \( p \)-value will be less than \( \min(k/1000, 1-k/1000) \), where \( k \) is the ordered position of our original sample estimate. Note if it is more extreme than the all randomised results \( p \)<0.001.
  • NB you'd double this for two-tailed test. Why? Imagine your sample actually had zero mean (you'd want a \( p \)-value of 1).

two-sample \(t\)-test - randomisation

  set.seed(3245)
  # simulate data where there is a difference between groups
  GP1 <- data.frame(Group = "GP1", response = rnorm(100, 10, 3))
  GP2 <- data.frame(Group = "GP2", response = rnorm(100, 20, 3))

  # we'll need this later
  estimatedDiff <- mean(GP1$response) - mean(GP2$response)

  simData <- rbind(GP1, GP2)

  # the parametric tests
  t.test(response ~ Group, data = simData)

    Welch Two Sample t-test

data:  response by Group
t = -22.714, df = 189.32, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.827362  -9.097018
sample estimates:
mean in group GP1 mean in group GP2 
         10.12634          20.08853 

two-sample \(t\)-test - randomisation

  set.seed(3245)  

  sampData <- simData

  # something to store our results
  simResults <- numeric(999)  

  for(i in 1: 999){

    # randomly arrange the group labels WRT y
    sampData$Group <- sample(simData$Group, 200, replace = F)

    # calculate and store the difference (under H0)
    means <- by(sampData$response, sampData$Group, mean) 

    simResults[i] <-   means[1]-means[2]

  }

two-sample \(t\)-test - randomisation

  hist(simResults, col = "slateblue4", xlim = c(-12, 12))

  abline(v = estimatedDiff, lwd = 3)

plot of chunk unnamed-chunk-3

You'll observe that what we saw in our data is extreme if \( H_0 \) is true

  addEst <- c(estimatedDiff, simResults)

  locEst <- c(1, rep(0, 999))

  locEst<- locEst[order(addEst)]

  k <- which(locEst == 1)

  min(k/1000, 1-k/1000)
[1] 0.001

single-sample \(t\)-test - randomisation

Consider the single sample \( t \)-test:

  • \( H_0 \) \( \mu=\theta \), or in plainer language: the mean the population we have sampled is equal to some theoretical constant.
  • \( H_A \) \( \mu\ne\theta \), or in plainer language: the mean of the population we have sampled is not equal to some theoretical constant.

single-sample \(t\)-test - randomisation

Simulate samples assuming the Null hypothesis is true. One way to approach this is:

  • Resample your data to get the approximate sampling distribution shape i.e.
    • Draw a random sample of size \( n \) (equal to the sample size) with replacement from this data.
    • Store the mean and repeat the process 999 times.
  • Centre on the Null hypothesised value
  • You now have an empirical sampling distribution of differences under the Null hypothesis.

single-sample \(t\)-test - randomisation

  • To determine the \( p \)-value for your actual test result, where did your observed value lie?
  • The approximate \( p \)-value will be less than \( \min(k/1000, 1-k/1000) \), where \( k \) is the ordered position of our original sample estimate. Note if it is the more extreme than the all randomised results \( p \)<0.001.

NB Equally, you could sample about your estimate and see where the \( H_0 \) value lies (but our narrative is sampling under \( H_0 \))

One-sample \(t\)-test - randomisation

  set.seed(3245)

  # simulated data where the mean is 2, plus a lot of noise
  GP1 <- data.frame(Group = "GP1", response = rnorm(100, 2, 10))

  # we'll need this later
  estimatedMean <- mean(GP1$response)

  # the parametric test of H0 = 0
  t.test(GP1$response, mu = 0)

    One Sample t-test

data:  GP1$response
t = 2.6419, df = 99, p-value = 0.009582
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.6027571 4.2395348
sample estimates:
mean of x 
 2.421146 

One-sample \(t\)-test - randomisation

  set.seed(3245)  

  # to store our simulations
  simResults <- numeric(999)  

  for(i in 1: 999){

    # get the sampling distribution 
    simResults[i] <- mean(sample(GP1$response, 100, replace = T))

  }

  # if H0 is true, it is centred on zero
  simResults <- simResults - estimatedMean

One-sample \(t\)-test - randomisation

  hist(simResults, col = "slateblue4")

  abline(v = estimatedMean, lwd = 3)

plot of chunk unnamed-chunk-7

So what we saw in our data is a little extreme if \( H_0 \) is true (so reject )

  addEst <- c(estimatedMean, simResults)

  locEst <- c(1, rep(0, 999))

  locEst<- locEst[order(addEst)]

  k <- which(locEst == 1)

  min(k/1000, 1-k/1000)*2
[1] 0.01

paired-sample \(t\)-test - randomisation

This is just an extension of the previous examples, but we resample paired differences where the group labels are randomised.

Extension to regression

Now consider a test for a slope parameter in a \( x\rightarrow y \) regression (or similarly a correlation).

We are seeking to test if there is a linear association between \( x \) and \( y \) - how to simulate \( H_0 \)?

  • \( H_0 \) \( \beta_1=0 \), or in plainer language: the true slope of the relationship between \( x \) and \( y \) is zero - no linear association.
  • \( H_A \) \( \beta_1\ne 0 \), or in plainer language: the true slope of the relationship between \( x \) and \( y \) is not zero - there is a linear association.

Simulation under \(H_0\)

  • Randomly jumble the \( x \) and \( y \) values relative to one another.
  • Calculate the regression coefficient(s).
  • Store this result and repeat the process 999 times.
  • You now have a distribution of differences under the Null hypothesis - the mean is approx. zero.
  • To determine the \( p \)-value for your actual test result, append the difference you actually observed and sort.
  • The approximate \( p \)-value will be less than \( \min(k/1000, 1-k/1000) \), where \( k \) is the ordered position of our original sample estimate. Note if it is the more extreme than the all randomised results \( p \)<0.001.

Extension to regression

Some simulated data

  set.seed(3245)
  # create data
  x <- runif(100)
  y <- 20 + 2*x + rnorm(100, 0, 2.5)

  regData <- data.frame(x, y)

  plot(x, y, pch = 20, col = 'purple', cex = 3)

plot of chunk unnamed-chunk-9

Extension to regression

  # the parametric model
  testLM <- lm(y ~ x, data = regData)

  summary(testLM)

Call:
lm(formula = y ~ x, data = regData)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8665 -1.6590 -0.3159  1.7036  5.7652 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  20.1308     0.4977  40.446   <2e-16 ***
x             2.2300     0.8531   2.614   0.0104 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.553 on 98 degrees of freedom
Multiple R-squared:  0.06518,   Adjusted R-squared:  0.05564 
F-statistic: 6.833 on 1 and 98 DF,  p-value: 0.01036
  # we'll need this later
  estimatedSlope <- coef(testLM)[2]

Extension to regression

We simulate samples drawn assuming the \( H_0 \) is true.

  set.seed(3245)  

  # to store results
  simResults <- numeric(999)  

  simData <- regData

  for(i in 1: 999){
    # shuffle the x WRT y
    simData$y <- sample(regData$y, 100, replace = T)

    # fit a model under H0
    simLM <- lm(y ~ x, data = simData)

    #store the slope
    simResults[i] <- coef(simLM)[2]

  }

Regression randomisation test

  hist(simResults, col = "slateblue4")

  abline(v = estimatedSlope, lwd = 3)

plot of chunk unnamed-chunk-12

So what we observed in our data is a little extreme if \( H_0 \) were true

  addEst <- c(estimatedSlope, simResults)

  locEst <- c(1, rep(0, 999))

  locEst<- locEst[order(addEst)]

  k <- which(locEst == 1)

  min(k/1000, 1-k/1000)*2
[1] 0.01

Extension to correlation

We simulate samples drawn assuming the \( H_0 \) is true.

  • Randomly jumble the \( x \) and \( y \) values relative to one another.
  • Calculate the correlation coefficient.

Proceed as for the regression, but use the simulated correlation coefficients

Bootstrap

This is frequently used for confidence intervals on parameters, but by resampling to avoid distributional assumptions

Basic bootstrapping: Introduction

  • The bootstrap is a method for accessing some of the properties of a statistical estimator in a non-parametric frame work.
  • We do not assume that the data is obtained from any parametric distribution (e.g. normal, exponential distribution, etc).
  • Originally, the bootstrap was introduced to compute standard error of an arbitrary estimator by Efron (1979).

Traditional Parametric Statistical Inference

Let \( \theta \) be the parameter we are interested in. To make inferences of \( \theta \) from a sample to a population, we must estimate \( \hat{\theta} \)'s sampling distribution. The traditional parametric approach to this task is:

  • to assume \( \hat{\theta} \)'s sampling distribution has a shape with known probability properties (e.g., a normal distribution).
  • to estimate analytically the parameters of that sampling distribution (e.g., the mean and the standard deviation of a normal distribution).

Traditional Parametric Statistical Inference

Consider the distribution of the average height of a random samples of 10 women.

  • If we assume women's height is normally distributed in the population, the sampling distribution of the average height of samples of women will also be normal.
  • Based on the this assumption and deduction:
    • sample mean, \( \bar{x}=1/n\sum_{i=1}^nx_i \) is our estimate of the mean of this distribution,
    • sample standard deviation of the mean, \( \hat{\sigma}_{\bar{x}}=\hat{\sigma}/\sqrt{n} \) is our estimate of the standard deviation of the sampling distribution.

Traditional Parametric Statistical Inference

If we draw a random sample of 10 women whose average height is 5'9'' with a standard deviation of 3'', our estimate of the sampling distribution of the sample mean of the women's height (for n=10) would be that

  • is normally distributed,
  • is centered on 5'9'', and
  • has a standard deviation of 0.95''(i.e., \( 3/\sqrt{10} \)).

Traditional Parametric Statistical Inference

Two aspects of the familiar exercise are especially relevant to the discussion of the bootstrap.

(1) Firstly, this parametric t-test tests \( H_0: \mu=\mu_0 \), based on an assumption about \( \bar{x} \)'s sampling distribution

\[ \begin{align*} \frac{\bar{x}-\mu_0}{\hat{\sigma}_{\bar{x}}}\sim t_{df=n-1}. \end{align*} \]

If this condition is not met, an inferential error could be made.

Traditional Parametric Statistical Inference

Two aspects of the familiar exercise are especially relevant to the discussion of the bootstrap.

(2) Parametric inference also requires estimates of the parameters of the assumed sampling distribution, e.g. standard deviation and mean in the case of a normal distribution. So we need to know the shape, but have methods/formula for calculating the parameter estimtes.

The problem can be hard: Some statistics (such as the difference between two sample medians) have no such formulas.

Bootstrap Statistical Inference

So

  • The traditional parametric approach to inference is sometimes less than ideal.
  • The bootstrap allows us to make inference without making these strong distributional assumptions and without the need for formulas for the sampling distribution's parameters.
  • The basic bootstrap approach is to treat the sample as if it is the population, and sample it (samples of size \( n \)) with replacement

We simulate the sampling distribution by simulating sampling

Bootstrap Statistical Inference

  • The randomisation tests we saw were empirical hypothesis tests, whereas bootstraps usually used for empirical CIs.
  • There are non-parametric vs. parametric bootstraps. We only really consider the non-parametric here.
  • We also only consider the quantile method (explained later), there are other approaches.

Extension to regression

Some simulated data and the parameteric result

  set.seed(3245)
   # create data
  x <- runif(100)
  y <- 20 + 2*x + rnorm(100, 0, 2.5)

  regData <- data.frame(x, y)

  plot(x, y, pch = 20, col = 'purple', cex = 3)

plot of chunk unnamed-chunk-14

Regression bootstrap coefficients

We simulate the sampling process

  set.seed(3245)  

  # something to store lots of regression coefficients
  bootResults <- array(dim = c(1000, 2))  


  for(i in 1:1000){

    # resample our data with replacement
    bootData <- regData[sample(1:100, 100, replace = T),]

    # fit the model under this alternative reality
    bootLM <- lm(y ~ x, data = bootData)

    # store the coefs
    bootResults[i, ] <- coef(bootLM)

  }

Regression bootstrap coefficients

The empirical (bootstrapped) sampling distributions for the parameters

  hist(bootResults[,1], col = "slateblue4", main = 'intercept distribution')

plot of chunk unnamed-chunk-16

  hist(bootResults[,2], col = "slateblue4", main = 'slope distribution')

plot of chunk unnamed-chunk-17

Regression bootstrap coefficients

Estimates by the quantiles (encapsulate the central 95% of the sampling distribution for a 95% CI)

  # our best guesses of parameters
  c(mean(bootResults[,1]), mean(bootResults[,2]))
[1] 20.114824  2.271685
  # the CIs for these
  rbind(quantile(bootResults[,1], probs = c(0.025, 0.975)),
  quantile(bootResults[,2], probs = c(0.025, 0.975)))
          2.5%     97.5%
[1,] 19.036846 21.160962
[2,]  0.671194  3.907041

Compared to the parametric

  coef(testLM)
(Intercept)           x 
  20.130780    2.230037 
  confint(testLM)
                 2.5 %    97.5 %
(Intercept) 19.1430677 21.118492
x            0.5371175  3.922956

Bootstrapping

Advantages of bootstrap

  • Flexible, we can focus on any quantity of interest, \( \theta \).
  • There is no need for modelling assumptions.
  • Can include model selection uncertainty in many cases.
  • Easy to interpret.

Disadvantages of bootstrap

  • For small datasets, bootstrapping empirical distribution can be a misleading estimator of population distribution (because of natural variation). Therefore the bootstrap estimator can be inaccurate.
  • bad' bootstraps - models that fail (or don't provide all the estimates you need).

Bootstrapping

Futher complications:

  • There are lots of different sorts of bootstraps.
  • Be very careful when data are not independent - in the simple case we are resampling independent bits of data - these may be blocks (e.g. subjects) or more complex still.

Recap and look-forwards

We've covered:

  • The basic concepts of computer intensive inference:
    • Randomisation tests to replace parametric tests
    • Bootstrapping to replace parametric confidence intervals
  • We've looked at applying these to replace:
    • Various t-tests
    • Regression CIs and tests on parameters

Upcoming:

  • Analysing tables of counts (Chi-square tests of various sorts)