In the hypothesis tests you learned earlier this semester (e.g. t-test, anova), you formed hypotheses, calculated the test stat, find the test stat’s sampling distribution under the null hypothesis based on statistical theory (e.g. a chi-squared dividied by another chi-squared is an F, assuming independence), and compared the test stat to its distribution under the null hypothesis to find the p-value.

In randomization-based hypothesis testing, we form hypotheses, calculate the test stat, create a distribution for the test stat under the null hypothesis, and use that distribution to approximate the p-value. If you want to do a randomization-based version of one of the tests that we just learned this semester, the only difference is that you need to simulate the sampling distribution under the null hypothesis.

Example: two means

Let’s consider the iris data and look at sepal length by species. Let’s focus just on two species: setosa and virginica. We can subset the data to create one dataset with just setosa and another dataset with just virginica with the code below.

setos <- subset(iris, Species == "setosa")
virg <- subset(iris, Species == "virginica")

To isolate the sepal lengths from each, we can use the dollar sign syntax. The name of the dataset goes before the dollar sign, and the name of the variable goes after the dollar sign. Below are the sepal lengths from just the setosa group.

setos$Sepal.Length
##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
Ssepal <- setos$Sepal.Length
Vsepal <- virg$Sepal.Length

The null hypothesis is that the mean sepal lengths of the two species are equal. The alternative is that the mean sepal lengths are not equal. First, let’s calculate the test stat for this test. We could choose a variety of test stats, but the simplest one is simply the difference in the sample means.

teststat <- mean(Ssepal) - mean(Vsepal)
teststat
## [1] -1.582

The null hypothesis is that the mean sepal lengths of the two species are equal. Informally, there’s no difference in sepal lengths between the two species under the null hypothesis, so we could mix all of these sepal lengths together and randomly split the 100 observations into two groups. We’d then label one of these groups virginica and the other setosa, calculate each group’s mean, then store the difference in the means. We’d repeat this process a large number of times (maybe m = \(10^4\)) to get the sampling distribution for our test stat (the difference in the means).

First, let’s set up a few things.

m <-10^4 #set the number of iterations in the simulation
sampdist <- rep(0, m) #we'll store the simulated test stats here
lengths <- c(Ssepal, Vsepal) #combine the data

Next, let’s actually run the loop to simulate the sampling distribution under the null hypothesis. We choose 50 numbers between 1 and 100, then use indexing to select those lengths from our merged data set and call them setosa. Then we select all the lengths EXCEPT the setosa lengths, and we call those virginica. Then we find the difference in the means and store that in our sampling distribution.

for(i in 1:m){
    choosethese <- sample(1:100, 50) 
    fakeset <- lengths[choosethese]
    fakevirg <- lengths[-choosethese]
    sampdist[i] <- mean(fakeset) - mean(fakevirg)
}

Now, we need to check our work to see whether it makes sense. Let’s look at the histogram. The null hypothesis is the two means are the same, and the test stat is the difference in sample means. So, if the null hypothesis is true, the sampling distribution of our test stat should be centered at 0.

hist(sampdist)

Yes, that’s centered at zero, so there is not an obvious problem. Let’s now calculate our p-value. Since we have a two sided alternative hypothesis, we need to find the smaller tail, calculate the proportion of samples that are as extreme or more extreme than our original test stat, and double that proportion. Our test stat is negative and our sampling distribution is centered at 0, so the smaller tail is the lower tail.

We can plot the test stat with the sampling distribution we’ve simulated to get an idea of how extreme our original test stat is compared to those we’d see if the null were true. In the code below, I expanded the limits of the x-axis so that we’d be zoomed out enough to actually see the test stat.

hist(sampdist, xlim =c(teststat, -teststat), xlab="setosa mean - virginica mean", main=NULL)
abline(v=teststat, lty=2)

Our test stat is way out there. There’s practically no way we’d see this test stat if the null hypothesis were actually true. Just for practice though, let’s calculate the p-value. The number is the number of test stats that are as extreme or more extreme than ours. In other words, we need to compare our test stat to each test stat created under the sampling distribution and decide whether the former is larger than the latter. (This is what the code insider the parentheses does; it returns a vector of length m with TRUE or FALSE in each entry.) Then we count those. Finally, we need to add one on because we need to include our original dataset’s test stat.

numerator <- sum(teststat > sampdist) + 1
denom <- m+1
lowtail <- numerator/denom
2*lowtail
## [1] 0.00019998

We get a really small p-value. Let’s think again if that makes sense. Check out the boxplots of the original data:

boxplot(Ssepal, Vsepal)

It seems really unlikely that the two means are the same, right? Of course our p-value is small.

Now you try

  1. Test whether the mean sepal length for setosa is smaller than the mean sepal length for virginica. (What do you need to change from the last problem?)

  2. Test whether the mean petal length for versicolor is smaller than the mean sepal length for virginica.

  3. Test whether the median petal length for versicolor is smaller than the median sepal length for virginica.

  4. Use the ColaCalcium data from the Lock5withR package. Does diet cola leach calcium out of women’s systems?

  5. Use the CommuteAtlanta and the CommuteStLouis data from the Lock5withR package. Is there a difference in the mean commute time for the two cities?

  6. Use the CompassionateRats data in the Lock5withR package. Here we are interested in comparing the proportion of rats who free a rat buddy. Is there a difference between male rats and female rats in terms of the probability of freeing a fellow trapped rat?

Paired data

In the examples earlier, we had independent data. What changes when we have paired data? Forget about randomization for a second. What did we do when we analyzed paired data earlier in the semester?

Let’s consider the anorexia data from the MASS package. Do patients gain weight after treatment? Our null hypothesis is that the patients did not gain weight after treatment. Our alternative hypothesis is that they did gain weight. When we analyzed the data earlier this semester (using the t-distribution), we first calculated the amount of weight each patient gained. Then we used a one sample t-test to look at the mean weight gained.

library(MASS)
data(anorexia)
anorexia$Gained <- anorexia$Postwt - anorexia$Prewt 
(teststat <- mean(anorexia$Gained))
## [1] 2.763889
hist(anorexia$Gained, xlab="weight gained (pounds)")

The histogram shows that more patients gained weight than lost weight. Let’s think about how to do this test using randomization.

If the mean weight gain were truly zero, then the weight for each person before and after treatment would be interchangeable. For each patient (row), we could swap the prewt and the postwt. This amounts to a sign change in the Gained column. We’re going to randomly select half the patients, then swap the signs of the Gained column for those patients.

#set things up
m <- 10^4
sampdist <-rep(0,m)
mydat <- anorexia$Gained
n <- length(mydat)

#simulate the test stat under the null
for(i in 1:m){
    tempdat<-mydat
    grabthese <- sample(1:n, 36)
    tempdat[grabthese] <- -mydat[grabthese]
    sampdist[i] <- mean(tempdat)
}

Does this sampling distribution look like it should under the null? (Yes, but why?)

hist(sampdist)

Your turn

  1. Finish testing whether patients gained weight on average. Be sure to create a histogram of the sampling distribution and include the test stat.

  2. Test whether students’ pulses are higher on average during quizzes than during lecture. Use the QuizPulse10 data from the Lock5withR package. Be sure to create a histogram of the sampling distribution and include the test stat.