Repeated Measure Randomization Test

This document below grew out of a conversation on isostat regarding how to implement a randomization test in a repeated measures setting.

Setting: [one] within-subject repeated measure and one between-subject measure

Based on: Based on R code by David Howell, but with a couple of small errors in that code corrected by JAW.

Loading the data

The code below was generated using dput(), a handy way to describe small data sets to make your examples completely self-contained with no external files.

PKUdata <- 
  structure(list(
    Subject = 
      structure(c(2L, 10L, 4L, 8L, 1L, 2L, 10L, 4L, 8L, 1L, 
                  3L, 5L, 6L, 7L, 9L, 3L, 5L, 6L, 7L, 9L), 
                .Label = c("AS", "BR", "DA", "KK", "MB", "MF", "MK", "TK", "TW", "WJ"), 
                class = "factor"), 
    Diet = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
                       2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), 
                     .Label = c("Low", "Normal"), class = "factor"), 
    DietControl = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
                              1L, 1L, 1L, 1L, 1L, 1L, 1L), 
                            .Label = c("Good", "Poor"), class = "factor"), 
    Y = c(87L, 35L, 36L, 72L, 20L, 166L, 27L, 70L, 94L, 52L, 
          50L, 40L, 117L, 97L, 36L, 192L, 158L, 103L, 197L, 150L)), 
    .Names = c("Subject", "Diet", "DietControl", "Y"), 
    class = "data.frame", 
    row.names = c(NA, -20L))

Jeff Witmer’s solution

Jeff Witmer originally posted this code to isostat with a challenge to the community to improve the code.

# this was in JW's original code, but I think it isn't required if we set this
# inside the call to aov()
# orig.options <- options(contrasts = c("contr.sum","contr.poly"))

model1 <- aov(Y~Diet*DietControl+Error(Subject/Diet),data=PKUdata, contrasts = contr.sum)
print(summary(model1))   # Standard repeated measures anova

## 
## Error: Subject
##             Df Sum Sq Mean Sq F value Pr(>F)  
## DietControl  1  11568   11568   5.739 0.0435 *
## Residuals    8  16126    2016                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Subject:Diet
##                  Df Sum Sq Mean Sq F value  Pr(>F)   
## Diet              1  19158   19158   16.24 0.00379 **
## Diet:DietControl  1   4530    4530    3.84 0.08570 . 
## Residuals         8   9436    1180                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

table <- (summary(model1))   # Extracting random F values
DC <- table[[1]]
F.DC.obt <- table[[1]][[1]][[4]][[1]]
dietTable <- table[[2]]
F.diet.obt <- dietTable[[1]][[4]][[1]]
F.interact.obt <- dietTable[[1]][[4]][[2]]
# Now I have stored away the obtained F values

# Beginning of randomization test
## I need to have the file ordered by the within-subjects factor, but respecting the 
## between-subjects levels
nreps <- 10000
F.DC.ran <- numeric(nreps)    # Vectors to store F values
F.diet.ran <- numeric(nreps)
F.interact.ran  <- numeric(nreps)

NoSubj <- 10    #number of subjects
LevelsBW <- 2  #number of levels of between subjects factor(s)
LevelsWI <- 2  #number of levels of within subjects factors(s)
NoSubjPer <- 5  #number of subjects per between subjects group

Diet <- factor(rep(1:LevelsWI, times = NoSubj))
Subject <- factor(rep(1:NoSubj, each = LevelsWI))

nobs <- length(PKUdata$Y)
for (i in 1:nreps) {
  randomdv <- NULL
  for (j in seq(from=1, to=nobs, by=LevelsWI) ) {
    randomdv <- c(randomdv,sample(PKUdata$Y[j:(j+LevelsWI-1)])) } 
  step1 <- rep(1:LevelsBW, each = NoSubjPer)
  step2 <- sample(step1, NoSubj, replace = FALSE)
  step3 <- rep(step2, each = LevelsBW)
  Grp <- factor(step3)      # Randomizing Group variable 
  modelrandom <- aov(randomdv ~ Diet*Grp+Error(Subject/Diet), contrasts = contr.sum)
  
  ##summary(modelrandom)
  
  table2 <- (summary(modelrandom))   # Extracting obtained F values
  DCtable <- table2[[1]]
  F.DC.ran[i] <- table2[[1]][[1]][[4]][[1]]
  dietTableRan <- table2[[2]]
  F.diet.ran[i] <- dietTableRan[[1]][[4]][[1]]
  F.interact.ran[i] <- dietTableRan[[1]][[4]][[2]]
}
probDC <- length(F.DC.ran[F.DC.ran >= F.DC.obt])/nreps
cat("Prob for Dietary Control = ", probDC, "\n")

## Prob for Dietary Control =  0.0408

probDiet <- length(F.diet.ran[F.diet.ran >= F.diet.obt])/nreps
cat("Prob for Diet = ", probDiet, "\n")

## Prob for Diet =  0.0016

probInteract <- length(F.interact.ran[F.interact.ran >= F.interact.obt])/nreps
cat("Prob for Diet by Dietary Control = ", probInteract, "\n")

## Prob for Diet by Dietary Control =  0.0864

Code from Randall Pruim

In two phases, an alternative coding style is developed. In the first phase, the code is modularized, but the guts of the randomization code remain. In the second phase, the randomization code itself is replaced.

Phase 1: Modularized version

Here we separate out the key elements:

generate 1 random data set
compute model from a data set
extract statistics from a model
replicate 3 previous steps many times
summarise the results of these replications

Advantages:

easier to see what is happening
easier to recycle/modify for a new problem
easier to debug (because you can access steps along the way)
things that are done to both original data and to randomized data look the same
easier to make improvements to just one section (and test to see that it still works like previous version)

We also introduce

better use of white space to make code more readable

# set the desired constrasts -- don't forget to revert later
options(contrasts = c("contr.sum","contr.poly"))

We begin by creating a function that extracts from a model the information we are interested in. This can be applied to both the original data and to the randomized data.

# pull the 3 F statistics out of a model object
extractStats <- function(model) {
  smodel <- summary(model)
  c(
    DC       = smodel[[1]][[1]][[4]][[1]],
    diet     = smodel[[2]][[1]][[4]][[1]],
    interact = smodel[[2]][[1]][[4]][[2]]
  )
}

model1 <- 
  aov(Y ~ Diet * DietControl + Error(Subject / Diet), data = PKUdata, contrasts = contr.sum)
observed <- extractStats(model1); observed

##        DC      diet  interact 
##  5.738689 16.241830  3.840490

The code below is essentially a copy and paste from Jeff’s but I’ve placed it in a funciton to make it easier to call. I like to isolate a single randomization into a funciton (often with arguments, but here with none) unless it is a one-liner.

# create a random data set

randomData <- function() {
  NoSubj <- 10   # number of subjects
  LevelsBW <- 2  # number of levels of between subjects factor(s)
  LevelsWI <- 2  # number of levels of within subjects factors(s)
  NoSubjPer <- 5 # number of subjects per between subjects group
  Diet <- factor(rep(1:LevelsWI, times = NoSubj))
  Subject <- factor(rep(1:NoSubj, each = LevelsWI))
  nobs <- length(PKUdata$Y)
  randomdv <- NULL
  for (j in seq(from = 1, to = nobs, by = LevelsWI) ) {
    randomdv <- c(randomdv, sample(PKUdata$Y[j:(j + LevelsWI - 1)])) } 
  step1 <- rep(1:LevelsBW, each = NoSubjPer)
  step2 <- sample(step1, NoSubj, replace = FALSE)
  step3 <- rep(step2, each = LevelsBW)
  Grp <- factor(step3)      # Randomizing Group variable 
  data.frame(Subject = Subject, Diet = Diet, Grp = Grp, Y = randomdv)
}

# construct a model from a random data set 
#   (this could be avoided since it is a one-liner)

There isn’t a good reason to use different variable names (e.g. Grp) in the randomized data, and we’ll undo that in the next phase. For now, we’ll just wrap this up into a function that generates a models from random data.

randomModel <- function() {
  aov(Y ~ Diet * Grp + Error(Subject / Diet), data = randomData(), contrasts = contr.sum)
}

We can test these components to see if they are behaving sanely.

head(randomModel(), 4)

## 
## Grand Mean: 89.95
## 
## Stratum 1: Subject
## 
## Terms:
##                      Grp Residuals
## Sum of Squares    884.45  30435.00
## Deg. of Freedom        1         8
## 
## Residual standard error: 61.67962
## Estimated effects are balanced
## 
## Stratum 2: Subject:Diet
## 
## Terms:
##                     Diet Diet:Grp Residuals
## Sum of Squares   5281.25  2311.25  21907.00
## Deg. of Freedom        1        1         8
## 
## Residual standard error: 52.32948
## Estimated effects are balanced

extractStats(randomModel())

##         DC       diet   interact 
## 1.98667145 0.15292429 0.03064034

extractStats(randomModel())

##          DC        diet    interact 
## 0.006761896 0.653591221 0.007765993

The simulations

Below we use mosaic::do() for the iteration, but you could preplace that with lapply() or replictate() if you prefer (at the cost of having to do some additional data wrangling afterwards).

require(mosaic)

## Loading required package: mosaic

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: lattice

## Loading required package: ggplot2

## Loading required package: mosaicData

## Loading required package: Matrix

## 
## The 'mosaic' package masks several functions from core packages in order to add additional features.  
## The original behavior of these functions should not be affected by this.

## 
## Attaching package: 'mosaic'

## The following object is masked from 'package:Matrix':
## 
##     mean

## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally

## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cov, D, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var

## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

# simulate many random data sets and extract their F statistics
# do() could be replaced by replicate(), lapply(), etc, if desired

SimStats <- do(1000) * extractStats(randomModel())

head(SimStats)  # to show the structure of the resulting data object

##          DC       diet   interact
## 1 5.1399713 0.22707467 0.07919245
## 2 0.4623719 1.55560986 0.05671367
## 3 0.2324823 0.41299465 6.22931738
## 4 3.4174345 0.03571404 0.06537276
## 5 0.3727853 0.30025599 0.61016258
## 6 6.8248979 0.36872764 0.02391232

# Summarise permuted data

SimStats %>%
  summarise(
    probDC   = prop(DC >= observed["DC"]),    # mosaic::prop() computes proportion of TRUE
    probDiet = prop(diet >= observed["diet"]),
    probInt  = prop(interact >= observed["interact"])
  )

##   probDC probDiet probInt
## 1   0.04    0.002   0.086

Phase 2: Improved generation of random data

The code for generating random data above is pretty opaque and brittle. In particular, rather than relying on the data set itself, it relies on an external description of the data which is hard-coded into the routine. Furthermore, it takes some careful sleuthing to determine just what randomization is being performed. Let’s see if we can make that clearer (and more flexible).

We begin by introducting a helper function that samples “block-wise”.

# x is a vector
# block is a vector of equal length
# x[i] and x[j] should match if block[i] == block[j]
# the result is a vector where x values have been permuted "block-wise"

blockSample <- function(x, blocks) {
  id <- as.numeric(factor(blocks))
  blocks <- aggregate(x, by = list(blocks), FUN = function(x) head(x, 1))$x
  sample(blocks)[id]
}

One might argue that blockSample() is not much simpler than the portion of the original code using step1, step2, step3, and it involves the use of aggregate(), which may be unfamiliar to beginners.
But it works much more generally (the number of observations per subject would not need to be the same for each subject) and isolates a useful operation into a separate, reusable function, which may be treated as a black box in other simulations.

In any case, now we can make it clearer what is going on in the simulated data.

We are shuffling the response values for each subject (taking advantage of the groups argument in mosaic::shuffle()), and
We are assigning each subject randomly to one of the DietControl groups.

Also, the simulated data has the same variable names and value types as the original data which makes side-by-side comparisons simpler and avoids the need to change variable names when fitting the model to the simulated data.

(This is still brittle with respect to the variable names used, and one could imagine fancier versions that avoid this.)

randomData2 <- function(data) {
  data %>% 
    mutate(
      Y = mosaic::shuffle(Y, groups = Subject),
      DietControl = blockSample(DietControl, blocks = Subject)
      )
}
randomData2(PKUdata) %>% arrange(Subject, Diet)

##    Subject   Diet DietControl   Y
## 1       AS    Low        Poor  52
## 2       AS Normal        Poor  20
## 3       BR    Low        Good 166
## 4       BR Normal        Good  87
## 5       DA    Low        Poor 192
## 6       DA Normal        Poor  50
## 7       KK    Low        Poor  36
## 8       KK Normal        Poor  70
## 9       MB    Low        Good  40
## 10      MB Normal        Good 158
## 11      MF    Low        Poor 117
## 12      MF Normal        Poor 103
## 13      MK    Low        Good 197
## 14      MK Normal        Good  97
## 15      TK    Low        Good  94
## 16      TK Normal        Good  72
## 17      TW    Low        Poor  36
## 18      TW Normal        Poor 150
## 19      WJ    Low        Good  35
## 20      WJ Normal        Good  27

The new simulation code is identical except for the use of our new data creator:

randomModel2 <- function() {
  aov(Y ~ Diet * DietControl + Error(Subject / Diet), data = randomData2(PKUdata), 
      contrasts = contr.sum)
} 

# simulate many random data sets and extract their F statistics
# do() could be replaced by replicate(), lapply(), etc, if desired

SimStats <- do(1000) * extractStats(randomModel2())

# Summarise permuted data

SimStats %>%
  summarise(
    probDC   = prop(DC >= observed["DC"]),
    probDiet = prop(diet >= observed["diet"]),
    probInt  = prop(interact >= observed["interact"])
  )

##   probDC probDiet probInt
## 1  0.055    0.005   0.099

# don't need to revert if we don't set the options
# options(orig.options)  # revert to previous options

Jeff Witmers’s repeated measures randomization

R Pruim