Purpose

In the interests of full disclosure, the purpose of this contribution to Rpubs is to present code for a couple of simple tests as implemented in R so they can be compared to Julia code that I will publish in a gist.

A randomization test in paired designs

One of the easiest types of statistical tests to explain is a randomization test comparing two samples of responses. In a paired design we reduce the two, equally-sized sets of responses to their differences. That is, if we write the two samples as \((x_i,y_i), i=1,\dots,n\) then we consider the differences \(d_i = y_i - x_i, i=1,\dots,n\).

Under the hypothesis \(H_0:\mu_X=\mu_Y\) that the means of the two populations are equal the signs of the individual differences are arbitrary. Under the alternative, say \(H_a:\mu_X < \mu_Y\) we expect the differences to be, on average, positive.

We can compare the sum (or, equivalently, the average) of the observed differences to the set of all possible differences that we could generate under \(H_0\) by flipping the signs of some of the differences. If we have \(k\) differences there will be \(2^k\) possible sums corresponding to changes of sign on some subset of the differences.

Sample data from a paired design

The Secchi Disk is used to measure water transparency in lakes or oceans. The measurements are conducted by lowering the disk into the water and recording the depth at which the disk is no longer visible. Such measurements were conducted in 1980 on 22 Wisconsin lakes and in 1990 on the same lakes. If the water clarity (or, conversely, the turbidity) remained constant than the differences in the Secchi depths would be neither systematically positive nor negative.

Secchi <- within(
    data.frame(d80=c(2.11,1.79,2.71,1.89,1.69,1.71,2.01,1.36,2.08,1.10,1.29,
                     2.11,2.47,1.67,1.78,1.68,1.47,1.67,2.31,1.76,1.58,2.55),
               d90=c(3.67,1.72,3.46,2.60,2.03,2.10,3.01,1.82,2.64,2.23,1.39,
                     2.08,2.92,1.90,2.44,2.23,2.43,1.91,3.06,2.26,1.48,2.35)),
    diff <- d90 - d80)
str(Secchi)
## 'data.frame':    22 obs. of  3 variables:
##  $ d80 : num  2.11 1.79 2.71 1.89 1.69 1.71 2.01 1.36 2.08 1.1 ...
##  $ d90 : num  3.67 1.72 3.46 2.6 2.03 2.1 3.01 1.82 2.64 2.23 ...
##  $ diff: num  1.56 -0.07 0.75 0.71 0.34 0.39 1 0.46 0.56 1.13 ...
sum(Secchi$diff)
## [1] 10.94

As we are primarily interested in the differences we check the distribution of the differences with an empirical density plot. plot of chunk dotplot

It seems that the differences are systematically positive, which is good news because it means that the lakes were, on average, cleaner in 1990 than in 1980.

Reference distribution of signed sums

To conduct the test we want the distribution of the possible sums resulting from arbitrary changes in the signs on the differences. There are \(2^{22}\) or 4.1943 × 106 such sums. Enumerating all of them is not easy in R. Certainly it would be a mistake to try to do so by creating 22 nested loops!

However, there are ways of generating all the possible combinations of -1 and +1, say by using expand.grid. For 3 differences it would look like

(gg <- expand.grid(c(-1,1),c(-1,1),c(-1,1)))
##   Var1 Var2 Var3
## 1   -1   -1   -1
## 2    1   -1   -1
## 3   -1    1   -1
## 4    1    1   -1
## 5   -1   -1    1
## 6    1   -1    1
## 7   -1    1    1
## 8    1    1    1
(mm <- Secchi$diff[1:3] * t(data.matrix(gg)))
##       [,1]  [,2]  [,3]  [,4]  [,5] [,6]  [,7]  [,8]
## Var1 -1.56  1.56 -1.56  1.56 -1.56 1.56 -1.56  1.56
## Var2  0.07  0.07 -0.07 -0.07  0.07 0.07 -0.07 -0.07
## Var3 -0.75 -0.75 -0.75 -0.75  0.75 0.75  0.75  0.75
colSums(mm)
## [1] -2.24  0.88 -2.38  0.74 -0.74  2.38 -0.88  2.24

For 22 differences, writing the vectors to expand will get rather tedious but we can use the do.call and lapply functions to generate the call to expand.grid. The reference distribution then becomes

system.time(
    refd <- colSums(Secchi$diff * 
                        t(data.matrix(
                            do.call(expand.grid,lapply(1:22, function(i) c(-1,1))
                                    )))))
##    user  system elapsed 
##  13.521   1.391  15.045
gc()
##           used (Mb) gc trigger (Mb)  max used   (Mb)
## Ncells  321024 17.2     597831   32    407500   21.8
## Vcells 4727773 36.1  176325680 1345 199763015 1524.1

The observed reference density is, not coincidentally, very like that of a “normal” or Gaussian distribution. plot of chunk refdens

The p-value for a test of \(H_0:\mu_{80}=\mu_{90}\) versus \(H_a:\mu_{80}<\mu{90}\) is

sum(refd >= sum(Secchi$diff))/length(refd)
## [1] 1.192e-05

Sampling from the reference distribution

A more common approach to working with the reference distribution is to generate a reasonably large random sample from the distribution. Generation of one instance of a sum with randomly allocated signs is often written using a random sample from a binomial distribution with size 1 and probability of success, 0.5

set.seed(1234321)
rbinom(22,1,0.5)
##  [1] 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 0 1

but it is somewhat cleaner to simply generate a sample from a uniform (0,1) distribution and compare them to 0.5

set.seed(1234321)
c(-1,1)[1 + (runif(22) > 0.5)]
##  [1]  1 -1  1  1  1 -1  1  1  1 -1 -1  1  1  1  1 -1  1 -1  1 -1 -1  1

We can vectorize this calculation to obtain, say, 100,000 samples from the distribution

ns <- 100000L
system.time(refsamp <- colSums(Secchi$diff * matrix(c(-1,1)[1 + (runif(22*ns)>0.5)],nrow=22)))
##    user  system elapsed 
##   0.254   0.000   0.255

with the corresponding density plot of chunk refdens1

and p-value

sum(refsamp >= sum(Secchi$diff))/length(refsamp)
## [1] 0

Notice that a sample of size 100,000 may not be large enough to evaluate this very small p-value accurately.

Randomization tests for independent samples

If we have collected responses under two different conditions and the allocation of conditions has been randomized then a test of, say, \(H_0:\mu_1=\mu_2\) versus \(H_a:\mu_1<\mu_2\) can be based upon the set of possible combinations of response values. Under the null hypothesis, the sample mean of the particular combination of responses that we saw for the first condition should be similar to the other possible sample means of combinations of responses of this size. Under the alternative hypothesis the sample mean from the combination of responses we saw should be systematically smaller than the other possible sample means.

In the first chapter of Bob Wardrop’s course notes for Statistics 371 he describes an experiment performed by a student on the consumption of treats by her cat according to whether the treats are chicken or tuna flavored. The experiment lasted for 20 days. The assignment of chicken or tuna on each day was randomized, subject to the constraint that each flavor is provided exactly 10 times.

Each day 10 treats of the indicated flavor were provided and the number consumed was recorded

treats <- c(4,3,5,0,5,4,5,6,1,7,6,3,7,1,3,6,3,5,1,2)

The tuna treats were provided on days

tuna <- c(2,3,4,6,10,12,14,17,19,20)

Thus the number of tuna treats consumed (out of a possible 100) was

sum(treats[tuna])
## [1] 29

The number of chicken treats consumed, also out of a possible 100, was

sum(treats) - sum(treats[tuna])
## [1] 48

The test

We want to compare the observed sum, 29, to the possible sums of any 10 of the 20 days. It is not easy to enumerate the possible sums. I don’t know of a good way in R of stepping through all the

choose(20L,10L)
## [1] 184756

possible selections.

The alternative is to produce a random sample from the reference distribution. About the best way I can think of doing this is

system.time(tunaref <- replicate(100000,sum(sample(treats,10L))))
##    user  system elapsed 
##   0.948   0.000   0.957

with the histogram plot of chunk hist

and the p-value of

sum(tunaref <= 29)/length(tunaref)
## [1] 0.02776