Professor James Scott walk-through
A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. Permutation tests exist for any test statistic, regardless of whether or not its distribution is known.
titanic = read.csv("TitanicSurvival.csv")
head(titanic)
## X survived sex age passengerClass
## 1 Allen, Miss. Elisabeth Walton yes female 29.0000 1st
## 2 Allison, Master. Hudson Trevor yes male 0.9167 1st
## 3 Allison, Miss. Helen Loraine no female 2.0000 1st
## 4 Allison, Mr. Hudson Joshua Crei no male 30.0000 1st
## 5 Allison, Mrs. Hudson J C (Bessi no female 25.0000 1st
## 6 Anderson, Mr. Harry yes male 48.0000 1st
#contingency table: survival status stratified by sex
t1 = xtabs(~sex + survived, data = titanic)
prop.table(t1, margin=1)
## survived
## sex no yes
## female 0.2725322 0.7274678
## male 0.8090154 0.1909846
Based on this table, it seems that male passengers had a far higher chance of dying than female passengers. A natural test statistic to quanitfy the association between rows and columns of this table is the relative risk of dying: p(men dying)/p(women dying)
#save table output in variable t1
t1 = xtabs(~sex + survived, data = titanic)
#convert to proportions
p1 = prop.table(t1, margin = 1)
#calculate relative risk of dying
#assign row, col number to variables
risk_female = p1[1,1]
risk_male = p1[2, 1]
relative_risk = risk_male/risk_female
relative_risk
## [1] 2.968513
The relative risk says that male passengers were about 3 times as likely to die as women.
A natural follow-up question: Could this 3:1 relative risk ratio be due to chance?
One way of addressing this is by something called a permutation test, which explicitly breaks any association between the predictor and the response by “shuffling the cards”.
First, we'll break the association between sex and survival status by shuffling the observations of the sex variable. Note: each time you run this you'll get a different permutation of the sex variable, and therefore a data frame in which the correspondence in the original data set has been broken.
library(mosaic)
attach(titanic)
names(titanic)
## [1] "X" "survived" "sex" "age"
## [5] "passengerClass"
titanic_shuffle = data.frame(shuffle(sex), survived)
head(titanic_shuffle, 10)
## shuffle.sex. survived
## 1 male yes
## 2 female yes
## 3 male no
## 4 male no
## 5 male no
## 6 male yes
## 7 male yes
## 8 female no
## 9 male yes
## 10 male no
The simple trick of shuffling the cards allows us to assess the plausible range of values for the relative risk under the assumption that there is no association between the two variables (null hypothesis).
t1_shuffle = xtabs(~shuffle(sex) + survived, data = titanic)
#quick way to calculate relative risk using built-in function
relrisk(t1_shuffle)
## [1] 0.9892993
Try executing the code block above a few times. You notice that each time you calculate the relative risk, you get something much closer to 1 than for the actual data.
Let’s use a Monte Carlo simulation to repeat this process 1000 times. This will allow us to make a histogram that shows the sampling distribution of the relative risk under the null hypothesis of no assocation between sex and survival status:
permtest1 = do(1000)*{
t1_shuffle = xtabs(~shuffle(sex) + survived, data = titanic)
relrisk(t1_shuffle)
}
head(permtest1)
## RR
## 1 1.0108670
## 2 1.0163535
## 3 1.0274428
## 4 0.9892993
## 5 1.0913492
## 6 1.0108670
hist(permtest1$RR, xlab = "Relative Risk", main = "Sampling Distribution of Relative Risk Ratios", col = "yellow", breaks = 20, xlim = c(0.8, 1.3), cex.lab = 1.5, cex.main = 1.5)
mean(permtest1$RR)
## [1] 1.000227
min(permtest1$RR)
## [1] 0.8898098
max(permtest1$RR)
## [1] 1.160642
When the cards are shuffled – and therefore when any possible connection between sex and survival status is explicitly broken – the relative risk almost never falls outside the interval (0.85, 1.15). This is pretty convincing evidence that the actual value we observed (about 3) could not have resulted due to chance.
Futhermore, repeatedly sampling the relative risk ratios resulting from different permutations of the dataset forces the sample mean to closely approximate what we intuitively know the population mean should be (if the null hypothesis is true: sex has no effect on whether or not one survived); 1.
#data = age, kidney function for a sample of 157 men from a single clinic
library(mosaic)
creatinine = read.csv("creatinine.csv", header = TRUE)
#creatclear: the patient’s creatinine-clearance rate, measured in ml/minute
head(creatinine)
## age creatclear
## 1 31 117.3
## 2 36 124.8
## 3 24 145.8
## 4 35 118.8
## 5 53 103.2
## 6 36 127.0
summary(creatinine)
## age creatclear
## Min. :18.00 Min. : 89.3
## 1st Qu.:25.00 1st Qu.:118.6
## Median :31.00 Median :128.0
## Mean :36.39 Mean :125.3
## 3rd Qu.:43.00 3rd Qu.:133.3
## Max. :88.00 Max. :147.6
The Question: What can we say about the average creatinine-clearance rate for the population of men who attend this clinic, on the basis of this particular sample of 157? The sample mean is easy enough to compute:
attach(creatinine)
## The following object is masked from titanic:
##
## age
mean(creatclear)
## [1] 125.2548
But we know that our sample mean of 125 won’t exactly equal the population mean. To quantify how far off our estimate is likely to be, we would like to know the standard error of the sample mean.
Moreover, we’d like to know this without taking many more samples of 157 from the population and seeing how the sample mean changes from one sample to the next.
The idea of Bootstrapping is is to pretend that your sample represents the whole population. We then take repeated “bootstrapped” samples from the original sample. Each bootstrapped sample is defined by two properties:
Let’s see this in action. First, let’s create a bootstrapped sample and look at the first 20 data points. We’ll do this using the sample command.
single_bootstrapped_sample = sample(creatinine, size = 157, replace=TRUE)
head(single_bootstrapped_sample, 20)
## age creatclear orig.ids
## 138 23 119.9 138
## 73 65 89.3 73
## 45 26 130.3 45
## 120 25 133.1 120
## 42 23 131.6 42
## 156 21 135.4 156
## 53 32 137.9 53
## 51 54 106.2 51
## 67 33 122.2 67
## 128 68 102.2 128
## 144 45 117.8 144
## 130 32 126.6 130
## 58 18 143.8 58
## 14 55 114.0 14
## 152 70 95.8 152
## 42.1 23 131.6 42
## 36 33 137.9 36
## 3 24 145.8 3
## 28 45 127.8 28
## 30 56 112.3 30
#You can visualize the pattern of ties and omissions with the following plot:
plot(table(single_bootstrapped_sample$orig.ids), xlab="Omissions", ylab="Frequency", main="Pattern of Ties & Omissions", cex.axis = 1.5, cex.lab = 1.5, cex.main=1.5)
The height of each bar shows how many times that original data point was picked. The gaps show data points that were omitted.
We’re now ready to estimate the sampling distribution of the sample mean by bootstrapping. Our basic procedure is:
We repeat this process a large number of times (say, 1000 or more). The key point is that, because each bootstrapped sample has a unique pattern of ties and omissions, each will have a different sample mean. The histogram of sample means across the bootstrapped samples then gives us an idea of how the sample mean changes from sample to sample.
my_boot = do(1000)*{
bootstrapped_sample = resample(creatinine)
mean(bootstrapped_sample$creatclear)
}
hist(my_boot$result, breaks = 20, main = "Sampling Distribution of CreatClear Sample Means", xlab = "Sample Means", ylab = "Frequency", col = "purple", cex.axis = 1.5, cex.lab = 1.5, cex.main = 1.5)
#creating lines for the mean and std. error of the mean
abline(v = (mean(my_boot$result) + sd(my_boot$result)), col = "red", lwd = 8)
abline(v = (mean(my_boot$result) - sd(my_boot$result)), col = "red", lwd = 8)
abline(v = mean(my_boot$result), col = "pink", lwd = 8)
Incidentally, if you repeatedly execute the above code block, you’ll get slightly different histograms and standard errors each time. We refer to this variability as “Monte Carlo error,” to distinguish it from the standard error of the estimator itself. In principle, you can drive the Monte Carlo error to virtually nothing by taking a very large number of bootstrapped samples.