This document below grew out of a conversation on isostat regarding how to implement a randomization test in a repeated measures setting.
Setting: [one] within-subject repeated measure and one between-subject measure
Based on: Based on R code by David Howell, but with a couple of small errors in that code corrected by JAW.
The code below was generated using dput(), a handy way to describe small data sets to make your examples completely self-contained with no external files.
PKUdata <-
structure(list(
Subject =
structure(c(2L, 10L, 4L, 8L, 1L, 2L, 10L, 4L, 8L, 1L,
3L, 5L, 6L, 7L, 9L, 3L, 5L, 6L, 7L, 9L),
.Label = c("AS", "BR", "DA", "KK", "MB", "MF", "MK", "TK", "TW", "WJ"),
class = "factor"),
Diet = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L),
.Label = c("Low", "Normal"), class = "factor"),
DietControl = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L),
.Label = c("Good", "Poor"), class = "factor"),
Y = c(87L, 35L, 36L, 72L, 20L, 166L, 27L, 70L, 94L, 52L,
50L, 40L, 117L, 97L, 36L, 192L, 158L, 103L, 197L, 150L)),
.Names = c("Subject", "Diet", "DietControl", "Y"),
class = "data.frame",
row.names = c(NA, -20L))
Jeff Witmer originally posted this code to isostat with a challenge to the community to improve the code.
# this was in JW's original code, but I think it isn't required if we set this
# inside the call to aov()
# orig.options <- options(contrasts = c("contr.sum","contr.poly"))
model1 <- aov(Y~Diet*DietControl+Error(Subject/Diet),data=PKUdata, contrasts = contr.sum)
print(summary(model1)) # Standard repeated measures anova
##
## Error: Subject
## Df Sum Sq Mean Sq F value Pr(>F)
## DietControl 1 11568 11568 5.739 0.0435 *
## Residuals 8 16126 2016
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Subject:Diet
## Df Sum Sq Mean Sq F value Pr(>F)
## Diet 1 19158 19158 16.24 0.00379 **
## Diet:DietControl 1 4530 4530 3.84 0.08570 .
## Residuals 8 9436 1180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table <- (summary(model1)) # Extracting random F values
DC <- table[[1]]
F.DC.obt <- table[[1]][[1]][[4]][[1]]
dietTable <- table[[2]]
F.diet.obt <- dietTable[[1]][[4]][[1]]
F.interact.obt <- dietTable[[1]][[4]][[2]]
# Now I have stored away the obtained F values
# Beginning of randomization test
## I need to have the file ordered by the within-subjects factor, but respecting the
## between-subjects levels
nreps <- 10000
F.DC.ran <- numeric(nreps) # Vectors to store F values
F.diet.ran <- numeric(nreps)
F.interact.ran <- numeric(nreps)
NoSubj <- 10 #number of subjects
LevelsBW <- 2 #number of levels of between subjects factor(s)
LevelsWI <- 2 #number of levels of within subjects factors(s)
NoSubjPer <- 5 #number of subjects per between subjects group
Diet <- factor(rep(1:LevelsWI, times = NoSubj))
Subject <- factor(rep(1:NoSubj, each = LevelsWI))
nobs <- length(PKUdata$Y)
for (i in 1:nreps) {
randomdv <- NULL
for (j in seq(from=1, to=nobs, by=LevelsWI) ) {
randomdv <- c(randomdv,sample(PKUdata$Y[j:(j+LevelsWI-1)])) }
step1 <- rep(1:LevelsBW, each = NoSubjPer)
step2 <- sample(step1, NoSubj, replace = FALSE)
step3 <- rep(step2, each = LevelsBW)
Grp <- factor(step3) # Randomizing Group variable
modelrandom <- aov(randomdv ~ Diet*Grp+Error(Subject/Diet), contrasts = contr.sum)
##summary(modelrandom)
table2 <- (summary(modelrandom)) # Extracting obtained F values
DCtable <- table2[[1]]
F.DC.ran[i] <- table2[[1]][[1]][[4]][[1]]
dietTableRan <- table2[[2]]
F.diet.ran[i] <- dietTableRan[[1]][[4]][[1]]
F.interact.ran[i] <- dietTableRan[[1]][[4]][[2]]
}
probDC <- length(F.DC.ran[F.DC.ran >= F.DC.obt])/nreps
cat("Prob for Dietary Control = ", probDC, "\n")
## Prob for Dietary Control = 0.0408
probDiet <- length(F.diet.ran[F.diet.ran >= F.diet.obt])/nreps
cat("Prob for Diet = ", probDiet, "\n")
## Prob for Diet = 0.0016
probInteract <- length(F.interact.ran[F.interact.ran >= F.interact.obt])/nreps
cat("Prob for Diet by Dietary Control = ", probInteract, "\n")
## Prob for Diet by Dietary Control = 0.0864
In two phases, an alternative coding style is developed. In the first phase, the code is modularized, but the guts of the randomization code remain. In the second phase, the randomization code itself is replaced.
Here we separate out the key elements:
Advantages:
We also introduce
# set the desired constrasts -- don't forget to revert later
options(contrasts = c("contr.sum","contr.poly"))
We begin by creating a function that extracts from a model the information we are interested in. This can be applied to both the original data and to the randomized data.
# pull the 3 F statistics out of a model object
extractStats <- function(model) {
smodel <- summary(model)
c(
DC = smodel[[1]][[1]][[4]][[1]],
diet = smodel[[2]][[1]][[4]][[1]],
interact = smodel[[2]][[1]][[4]][[2]]
)
}
model1 <-
aov(Y ~ Diet * DietControl + Error(Subject / Diet), data = PKUdata, contrasts = contr.sum)
observed <- extractStats(model1); observed
## DC diet interact
## 5.738689 16.241830 3.840490
The code below is essentially a copy and paste from Jeff’s but I’ve placed it in a funciton to make it easier to call. I like to isolate a single randomization into a funciton (often with arguments, but here with none) unless it is a one-liner.
# create a random data set
randomData <- function() {
NoSubj <- 10 # number of subjects
LevelsBW <- 2 # number of levels of between subjects factor(s)
LevelsWI <- 2 # number of levels of within subjects factors(s)
NoSubjPer <- 5 # number of subjects per between subjects group
Diet <- factor(rep(1:LevelsWI, times = NoSubj))
Subject <- factor(rep(1:NoSubj, each = LevelsWI))
nobs <- length(PKUdata$Y)
randomdv <- NULL
for (j in seq(from = 1, to = nobs, by = LevelsWI) ) {
randomdv <- c(randomdv, sample(PKUdata$Y[j:(j + LevelsWI - 1)])) }
step1 <- rep(1:LevelsBW, each = NoSubjPer)
step2 <- sample(step1, NoSubj, replace = FALSE)
step3 <- rep(step2, each = LevelsBW)
Grp <- factor(step3) # Randomizing Group variable
data.frame(Subject = Subject, Diet = Diet, Grp = Grp, Y = randomdv)
}
# construct a model from a random data set
# (this could be avoided since it is a one-liner)
There isn’t a good reason to use different variable names (e.g. Grp) in the randomized data, and we’ll undo that in the next phase. For now, we’ll just wrap this up into a function that generates a models from random data.
randomModel <- function() {
aov(Y ~ Diet * Grp + Error(Subject / Diet), data = randomData(), contrasts = contr.sum)
}
We can test these components to see if they are behaving sanely.
head(randomModel(), 4)
##
## Grand Mean: 89.95
##
## Stratum 1: Subject
##
## Terms:
## Grp Residuals
## Sum of Squares 884.45 30435.00
## Deg. of Freedom 1 8
##
## Residual standard error: 61.67962
## Estimated effects are balanced
##
## Stratum 2: Subject:Diet
##
## Terms:
## Diet Diet:Grp Residuals
## Sum of Squares 5281.25 2311.25 21907.00
## Deg. of Freedom 1 1 8
##
## Residual standard error: 52.32948
## Estimated effects are balanced
extractStats(randomModel())
## DC diet interact
## 1.98667145 0.15292429 0.03064034
extractStats(randomModel())
## DC diet interact
## 0.006761896 0.653591221 0.007765993
Below we use mosaic::do() for the iteration, but you could preplace that with lapply() or replictate() if you prefer (at the cost of having to do some additional data wrangling afterwards).
require(mosaic)
## Loading required package: mosaic
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: mosaicData
## Loading required package: Matrix
##
## The 'mosaic' package masks several functions from core packages in order to add additional features.
## The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cov, D, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
# simulate many random data sets and extract their F statistics
# do() could be replaced by replicate(), lapply(), etc, if desired
SimStats <- do(1000) * extractStats(randomModel())
head(SimStats) # to show the structure of the resulting data object
## DC diet interact
## 1 5.1399713 0.22707467 0.07919245
## 2 0.4623719 1.55560986 0.05671367
## 3 0.2324823 0.41299465 6.22931738
## 4 3.4174345 0.03571404 0.06537276
## 5 0.3727853 0.30025599 0.61016258
## 6 6.8248979 0.36872764 0.02391232
# Summarise permuted data
SimStats %>%
summarise(
probDC = prop(DC >= observed["DC"]), # mosaic::prop() computes proportion of TRUE
probDiet = prop(diet >= observed["diet"]),
probInt = prop(interact >= observed["interact"])
)
## probDC probDiet probInt
## 1 0.04 0.002 0.086
The code for generating random data above is pretty opaque and brittle. In particular, rather than relying on the data set itself, it relies on an external description of the data which is hard-coded into the routine. Furthermore, it takes some careful sleuthing to determine just what randomization is being performed. Let’s see if we can make that clearer (and more flexible).
We begin by introducting a helper function that samples “block-wise”.
# x is a vector
# block is a vector of equal length
# x[i] and x[j] should match if block[i] == block[j]
# the result is a vector where x values have been permuted "block-wise"
blockSample <- function(x, blocks) {
id <- as.numeric(factor(blocks))
blocks <- aggregate(x, by = list(blocks), FUN = function(x) head(x, 1))$x
sample(blocks)[id]
}
One might argue that blockSample() is not much simpler than the portion of the original code using step1, step2, step3, and it involves the use of aggregate(), which may be unfamiliar to beginners.
But it works much more generally (the number of observations per subject would not need to be the same for each subject) and isolates a useful operation into a separate, reusable function, which may be treated as a black box in other simulations.
In any case, now we can make it clearer what is going on in the simulated data.
We are shuffling the response values for each subject (taking advantage of the groups argument in mosaic::shuffle()), and
We are assigning each subject randomly to one of the DietControl groups.
Also, the simulated data has the same variable names and value types as the original data which makes side-by-side comparisons simpler and avoids the need to change variable names when fitting the model to the simulated data.
(This is still brittle with respect to the variable names used, and one could imagine fancier versions that avoid this.)
randomData2 <- function(data) {
data %>%
mutate(
Y = mosaic::shuffle(Y, groups = Subject),
DietControl = blockSample(DietControl, blocks = Subject)
)
}
randomData2(PKUdata) %>% arrange(Subject, Diet)
## Subject Diet DietControl Y
## 1 AS Low Poor 52
## 2 AS Normal Poor 20
## 3 BR Low Good 166
## 4 BR Normal Good 87
## 5 DA Low Poor 192
## 6 DA Normal Poor 50
## 7 KK Low Poor 36
## 8 KK Normal Poor 70
## 9 MB Low Good 40
## 10 MB Normal Good 158
## 11 MF Low Poor 117
## 12 MF Normal Poor 103
## 13 MK Low Good 197
## 14 MK Normal Good 97
## 15 TK Low Good 94
## 16 TK Normal Good 72
## 17 TW Low Poor 36
## 18 TW Normal Poor 150
## 19 WJ Low Good 35
## 20 WJ Normal Good 27
The new simulation code is identical except for the use of our new data creator:
randomModel2 <- function() {
aov(Y ~ Diet * DietControl + Error(Subject / Diet), data = randomData2(PKUdata),
contrasts = contr.sum)
}
# simulate many random data sets and extract their F statistics
# do() could be replaced by replicate(), lapply(), etc, if desired
SimStats <- do(1000) * extractStats(randomModel2())
# Summarise permuted data
SimStats %>%
summarise(
probDC = prop(DC >= observed["DC"]),
probDiet = prop(diet >= observed["diet"]),
probInt = prop(interact >= observed["interact"])
)
## probDC probDiet probInt
## 1 0.055 0.005 0.099
# don't need to revert if we don't set the options
# options(orig.options) # revert to previous options