The code for this file is available to you. You might want to run this code on your own computer. Download and install R (or use RStudio.Cloud), install the package “dplyr” and then create a project folder and place the .Rmd file (the code file) into this folder. Then run it. In the “environment” you can see the full data sets – the illustrations below include only a few rows each time.

Randomization of Study Subjects

Suppose you have recruited N study subjects, and you have information about them on multiple attributes (A1, A2, etc.), such as Gender, Ethnicity, etc. Place these N subjects into a table or data set where each row has a unique identifier for the subject, followed by the attribute values.

To make this real, let’s make an example. Suppose we are interested in 3 attributes: age (“child”, “youth”, “adult”); style (1, 2); and origin (“California”, “US”, “International”). We declare these 3 sets.

age.set <- c("child", "youth", "adult")
style.set <- c(1,2)
origin.set <- c("California", "US", "International")

Example Data

Now let us generate an example data set. Suppose there are 100 subjects, and identify them as S1, S2, … S100. And we generate their attributes values randomly. This is just to create the data set for this illustration (using the “sample” function in R). In your own project, you would supply this data set – it would be your list of subjects along with the attributes and values you collected about them.

N <- 100
ids <- paste("S", 1:100, sep="")

age <- age.set[sample(length(age.set), N, replace=TRUE)]

style <- style.set[sample(length(style.set), N, replace=TRUE)]

origin <- origin.set[sample(length(origin.set), N, replace=TRUE)]

df <- data.frame(ids, age, style, origin)

Let us look at a sample of the subject data.

print(head(df))
##   ids   age style        origin
## 1  S1 youth     1            US
## 2  S2 youth     1            US
## 3  S3 child     1    California
## 4  S4 child     1            US
## 5  S5 child     1 International
## 6  S6 youth     2    California

Let us see what the breakdown of subject numbers is along each dimension (i.e., how many subjects have each of the possible attributes).

print(df %>% count(age))
##     age  n
## 1 adult 35
## 2 child 31
## 3 youth 34
print(df %>% count(style))
##   style  n
## 1     1 56
## 2     2 44
print(df %>% count(origin))
##          origin  n
## 1    California 32
## 2 International 36
## 3            US 32

Next we want to assign these subjects into (say) m=2 groups. We want to do this in a way that each group has an equal number of subjects on each of the attribute values.

Simple Random Allocation

Suppose first that we randomly assigned subjects to two groups, 1 and 2. That is, for each of the subjects we create a new column “Group” whose values are obtained by sampling (with replacement) from the set {1, 2} which has 2 elements. It is also possible here that groups do not have exactly equal sizes.

gp1 <- data.frame(ids= sample(df$ids, N/2), group = "G1") # randomly assign half to group 1
gp2 <- data.frame(ids= df$ids[! df$ids %in% gp1$ids], group = "G2") # rest to group 2

gps <- rbind(gp1, gp2) # combine the two vertically

df <- merge(df, gps, by="ids")

print(head(df)) 
##    ids   age style        origin group
## 1   S1 youth     1            US    G2
## 2  S10 adult     1            US    G2
## 3 S100 child     1 International    G1
## 4  S11 child     1    California    G1
## 5  S12 adult     1            US    G1
## 6  S13 youth     1    California    G1

Let us examine the distribution of attribute values within each group.

print(with(df, table(group, age)) )
##      age
## group adult child youth
##    G1    20    15    15
##    G2    15    16    19
print(with(df, table(group, style)) )
##      style
## group  1  2
##    G1 34 16
##    G2 22 28
print(with(df, table(group, origin)) )
##      origin
## group California International US
##    G1         18            17 15
##    G2         14            19 17
# print(df %>% count(group, age))

So, you can see that the group attribute distributions are not perfectly balanced – but they are not too bad either. Simple randomization might be an effective compromise for now.