In the interests of full disclosure, the purpose of this contribution to Rpubs is to present code for a couple of simple tests as implemented in R so they can be compared to Julia code that I will publish in a gist.
One of the easiest types of statistical tests to explain is a randomization test comparing two samples of responses. In a paired design we reduce the two, equally-sized sets of responses to their differences. That is, if we write the two samples as \((x_i,y_i), i=1,\dots,n\) then we consider the differences \(d_i = y_i - x_i, i=1,\dots,n\).
Under the hypothesis \(H_0:\mu_X=\mu_Y\) that the means of the two populations are equal the signs of the individual differences are arbitrary. Under the alternative, say \(H_a:\mu_X < \mu_Y\) we expect the differences to be, on average, positive.
We can compare the sum (or, equivalently, the average) of the observed differences to the set of all possible differences that we could generate under \(H_0\) by flipping the signs of some of the differences. If we have \(k\) differences there will be \(2^k\) possible sums corresponding to changes of sign on some subset of the differences.
The Secchi Disk is used to measure water transparency in lakes or oceans. The measurements are conducted by lowering the disk into the water and recording the depth at which the disk is no longer visible. Such measurements were conducted in 1980 on 22 Wisconsin lakes and in 1990 on the same lakes. If the water clarity (or, conversely, the turbidity) remained constant than the differences in the Secchi depths would be neither systematically positive nor negative.
Secchi <- within(
data.frame(d80=c(2.11,1.79,2.71,1.89,1.69,1.71,2.01,1.36,2.08,1.10,1.29,
2.11,2.47,1.67,1.78,1.68,1.47,1.67,2.31,1.76,1.58,2.55),
d90=c(3.67,1.72,3.46,2.60,2.03,2.10,3.01,1.82,2.64,2.23,1.39,
2.08,2.92,1.90,2.44,2.23,2.43,1.91,3.06,2.26,1.48,2.35)),
diff <- d90 - d80)
str(Secchi)
## 'data.frame': 22 obs. of 3 variables:
## $ d80 : num 2.11 1.79 2.71 1.89 1.69 1.71 2.01 1.36 2.08 1.1 ...
## $ d90 : num 3.67 1.72 3.46 2.6 2.03 2.1 3.01 1.82 2.64 2.23 ...
## $ diff: num 1.56 -0.07 0.75 0.71 0.34 0.39 1 0.46 0.56 1.13 ...
sum(Secchi$diff)
## [1] 10.94
As we are primarily interested in the differences we check the distribution of the differences with an empirical density plot.
It seems that the differences are systematically positive, which is good news because it means that the lakes were, on average, cleaner in 1990 than in 1980.
To conduct the test we want the distribution of the possible sums resulting from arbitrary changes in the signs on the differences. There are \(2^{22}\) or 4.1943 × 106 such sums. Enumerating all of them is not easy in R. Certainly it would be a mistake to try to do so by creating 22 nested loops!
However, there are ways of generating all the possible combinations of -1 and +1, say by using expand.grid
. For 3 differences it would look like
(gg <- expand.grid(c(-1,1),c(-1,1),c(-1,1)))
## Var1 Var2 Var3
## 1 -1 -1 -1
## 2 1 -1 -1
## 3 -1 1 -1
## 4 1 1 -1
## 5 -1 -1 1
## 6 1 -1 1
## 7 -1 1 1
## 8 1 1 1
(mm <- Secchi$diff[1:3] * t(data.matrix(gg)))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## Var1 -1.56 1.56 -1.56 1.56 -1.56 1.56 -1.56 1.56
## Var2 0.07 0.07 -0.07 -0.07 0.07 0.07 -0.07 -0.07
## Var3 -0.75 -0.75 -0.75 -0.75 0.75 0.75 0.75 0.75
colSums(mm)
## [1] -2.24 0.88 -2.38 0.74 -0.74 2.38 -0.88 2.24
For 22 differences, writing the vectors to expand will get rather tedious but we can use the do.call
and lapply
functions to generate the call to expand.grid
. The reference distribution then becomes
system.time(
refd <- colSums(Secchi$diff *
t(data.matrix(
do.call(expand.grid,lapply(1:22, function(i) c(-1,1))
)))))
## user system elapsed
## 13.521 1.391 15.045
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 321024 17.2 597831 32 407500 21.8
## Vcells 4727773 36.1 176325680 1345 199763015 1524.1
The observed reference density is, not coincidentally, very like that of a “normal” or Gaussian distribution.
The p-value for a test of \(H_0:\mu_{80}=\mu_{90}\) versus \(H_a:\mu_{80}<\mu{90}\) is
sum(refd >= sum(Secchi$diff))/length(refd)
## [1] 1.192e-05
A more common approach to working with the reference distribution is to generate a reasonably large random sample from the distribution. Generation of one instance of a sum with randomly allocated signs is often written using a random sample from a binomial distribution with size 1 and probability of success, 0.5
set.seed(1234321)
rbinom(22,1,0.5)
## [1] 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 0 1
but it is somewhat cleaner to simply generate a sample from a uniform (0,1) distribution and compare them to 0.5
set.seed(1234321)
c(-1,1)[1 + (runif(22) > 0.5)]
## [1] 1 -1 1 1 1 -1 1 1 1 -1 -1 1 1 1 1 -1 1 -1 1 -1 -1 1
We can vectorize this calculation to obtain, say, 100,000 samples from the distribution
ns <- 100000L
system.time(refsamp <- colSums(Secchi$diff * matrix(c(-1,1)[1 + (runif(22*ns)>0.5)],nrow=22)))
## user system elapsed
## 0.254 0.000 0.255
with the corresponding density
and p-value
sum(refsamp >= sum(Secchi$diff))/length(refsamp)
## [1] 0
Notice that a sample of size 100,000 may not be large enough to evaluate this very small p-value accurately.
If we have collected responses under two different conditions and the allocation of conditions has been randomized then a test of, say, \(H_0:\mu_1=\mu_2\) versus \(H_a:\mu_1<\mu_2\) can be based upon the set of possible combinations of response values. Under the null hypothesis, the sample mean of the particular combination of responses that we saw for the first condition should be similar to the other possible sample means of combinations of responses of this size. Under the alternative hypothesis the sample mean from the combination of responses we saw should be systematically smaller than the other possible sample means.
In the first chapter of Bob Wardrop’s course notes for Statistics 371 he describes an experiment performed by a student on the consumption of treats by her cat according to whether the treats are chicken or tuna flavored. The experiment lasted for 20 days. The assignment of chicken or tuna on each day was randomized, subject to the constraint that each flavor is provided exactly 10 times.
Each day 10 treats of the indicated flavor were provided and the number consumed was recorded
treats <- c(4,3,5,0,5,4,5,6,1,7,6,3,7,1,3,6,3,5,1,2)
The tuna treats were provided on days
tuna <- c(2,3,4,6,10,12,14,17,19,20)
Thus the number of tuna treats consumed (out of a possible 100) was
sum(treats[tuna])
## [1] 29
The number of chicken treats consumed, also out of a possible 100, was
sum(treats) - sum(treats[tuna])
## [1] 48
We want to compare the observed sum, 29, to the possible sums of any 10 of the 20 days. It is not easy to enumerate the possible sums. I don’t know of a good way in R of stepping through all the
choose(20L,10L)
## [1] 184756
possible selections.
The alternative is to produce a random sample from the reference distribution. About the best way I can think of doing this is
system.time(tunaref <- replicate(100000,sum(sample(treats,10L))))
## user system elapsed
## 0.948 0.000 0.957
with the histogram
and the p-value of
sum(tunaref <= 29)/length(tunaref)
## [1] 0.02776