Designing Experiments and Analyzing Data: A Model Comparison Perspective (3rd edition) by Maxwell, Delaney, & Kelley

Information about the book is available at https://designingexperiments.com

R Code and Instructions to Accompany Chapter 1

Here, we use the AMCP package that accompanies the book (Version 0.0.4 or greater) to load the data into the working directory with the data() function. Note that the data sets are organized using the chapter number and table number, each separated with an underscore (e.g., chapter_1_table_1; chapter_11_table_20, chapter_13_table_5, etc.). Similarly, for end-of-chapter exercises, the organization is such that “table” is replaced with “exercise” (e.g., chapter_1_exercise_21; chapter_3_table_9, chapter_16_exercise_5, etc.). Additionally, abbreviated names can be used where “C” replaces “chapter” and “T” or “E” replaces “table” or “exercise,” respectively. For example, C1T1 can be used to reference chapter_1_table_1 or C1E21 can be used to reference chapter_1_exercise_21, et cetera. There are a few special cases as well, such as tutorials, raw data (e.g., when summary information is provided in the book), and univariate format (long) vs. multivariate format (wide) is used.

For a list of data sets use data() with package="AMCP" specified, specifically as data(package="AMCP"). All of the details needed for each data set can be found via data(package="AMCP") Note also that the data sets are available as comma space delimited (CSV) files (as well as SPSS and SAS) on the DesigningExperiments.Com web site, which is our web site that accompanies the book.

To begin, we install the AMCP package (this only needs to happen once per system unless you update R or you want to install a newer version of AMCP (if applicable) into the R session using the library() function. The install.packages() function can be used as shown below. Note that you may need to select a mirror (select one near your location) to download the package. Note that number sign (also called a hash character or pound sign) is how to denote a comment within R code, as R does not reed what is to the right of a number sign.

# Only necessary once on current system for the current version of a package. Uncomment to install. 
# install.packages("AMCP")

Next, we load the AMCP package into the current R session using the library() function. The AMCP package contains all of the data from the book.

library(AMCP)

This code depends on version 0.0.4 of AMCP. Ensure that version (or newer) is loaded.

if((getNamespaceVersion("AMCP") >= "0.0.4")==FALSE) print("Please install the most recent version of the R package 'AMCP'.")

Next, we load the relevant data set of interest via the data() function. For the Chapter 1, Table 1 data, we do the following.

data(chapter_1_table_1)

To return the data (i.e., see what it looks like) you can simply call upon it by typing (or copying code and then pasting it into the console) before hitting return. Notice here and elsewhere that R output is preceded by a two number signs (also called a hash character or pound sign). Note that hitting the up arrow a previously used command will be recalled and ready to be entered again.

chapter_1_table_1

##    treat control week
## 1     28      32    1
## 2     31      25    1
## 3     25      15    1
## 4     23      25    1
## 5     28      16    1
## 6     26      30    2
## 7     36      24    2
## 8     23      13    2
## 9     23      25    2
## 10    24      16    2

For larger data sets, you may only want to see the first part (using head()) or the last part (using tail()).

head(chapter_1_table_1)

##   treat control week
## 1    28      32    1
## 2    31      25    1
## 3    25      15    1
## 4    23      25    1
## 5    28      16    1
## 6    26      30    2

tail(chapter_1_table_1)

##    treat control week
## 5     28      16    1
## 6     26      30    2
## 7     36      24    2
## 8     23      13    2
## 9     23      25    2
## 10    24      16    2

This data set is structured by Condition and by Week in a paired samples design. Often useful is a variable that is the difference between the two (paired) conditions. Although there are other ways in which this can be done within R, here we extract the Treatment and Control values from the data frame, take the difference, and then create a new variable called Difference that is augmented to the data frame. We use the $ operator, which extracts the variable on the right from the data frame on the left. For example, Data$Score would extract the variable Score from the data frame Data. In addition to extracting variables, the $ operator can be used to assign a new variable to a data frame, as is done by placing $ after the data frame (i.e., the chapter 1, table 1 object) but before the (new) variable that is being created (i.e., Difference) using the assignment operator (i.e., <-) as shown below. The assignment operator assigns what is on the right (whether it is the result of a calculation as below or another object) to the object on the left.

chapter_1_table_1$Difference <- chapter_1_table_1$treat - chapter_1_table_1$control
chapter_1_table_1

##    treat control week Difference
## 1     28      32    1         -4
## 2     31      25    1          6
## 3     25      15    1         10
## 4     23      25    1         -2
## 5     28      16    1         12
## 6     26      30    2         -4
## 7     36      24    2         12
## 8     23      13    2         10
## 9     23      25    2         -2
## 10    24      16    2          8

There are many ways to summarize data in R. Using only “base R” (i.e., the functions that come with R and not any add-on packages) we first extract the particular subsets of interest (e.g., week 1 for the treatment group) and then apply a particular function to the subsets of interest. Using subsets of the data is important here because of the way the data are structured (by Condition and by Week), and recalling that this is a paired samples design there is a linkage among the scores that cannot be ignored. First, for illustration, we extract the week 1 scores for the Treatment group. We do this with square brackets to subset the data frame into the identified rows and the identified column (variable) of interest. The specific set of rows (i.e., what is to the left of the comma below) is when week from the data set is equal to 1: week==1. Note that we obtain TRUE or FALSE values for the week variable from the data chapter_1_table_1 and then R will subset the data to those in which the row values is TRUE (thus ignoring the rows in which week does not equal 1). To the right of the comma but still within the square brackets is the variable(s) of interest, here only treat, which we encase in double quotes. Thus, we have subsetted the data for Week=1 for the Treatment group as follows:

chapter_1_table_1[chapter_1_table_1$week==1, "treat"]

## [1] 28 31 25 23 28

Now, taking the subsetting ideas above and applying a few functions, we can obtain some descriptive statistics for Week=1 for the Treatment group (other descriptive statistics can be requested as well).

sum(chapter_1_table_1[chapter_1_table_1$week==1, "treat"])

## [1] 135

mean(chapter_1_table_1[chapter_1_table_1$week==1, "treat"])

## [1] 27

sd(chapter_1_table_1[chapter_1_table_1$week==1, "treat"])

## [1] 3.082207

summary(chapter_1_table_1[chapter_1_table_1$week==1, "treat"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      23      25      28      27      28      31

We also obtain some descriptive statistics for Week=1 for the Control group,

sum(chapter_1_table_1[chapter_1_table_1$week==1, "control"])

## [1] 113

mean(chapter_1_table_1[chapter_1_table_1$week==1, "control"])

## [1] 22.6

sd(chapter_1_table_1[chapter_1_table_1$week==1, "control"])

## [1] 7.092249

summary(chapter_1_table_1[chapter_1_table_1$week==1, "control"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    15.0    16.0    25.0    22.6    25.0    32.0

We also obtain some descriptive statistics for Week=2 for the Treatment group,

sum(chapter_1_table_1[chapter_1_table_1$week==2, "treat"])

## [1] 132

mean(chapter_1_table_1[chapter_1_table_1$week==2, "treat"])

## [1] 26.4

sd(chapter_1_table_1[chapter_1_table_1$week==2, "treat"])

## [1] 5.504544

summary(chapter_1_table_1[chapter_1_table_1$week==2, "treat"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    23.0    23.0    24.0    26.4    26.0    36.0

We also obtain some descriptive statistics for Week=2 for the Control group,

sum(chapter_1_table_1[chapter_1_table_1$week==2, "control"])

## [1] 108

mean(chapter_1_table_1[chapter_1_table_1$week==2, "control"])

## [1] 21.6

sd(chapter_1_table_1[chapter_1_table_1$week==2, "control"])

## [1] 6.94982

summary(chapter_1_table_1[chapter_1_table_1$week==2, "control"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    13.0    16.0    24.0    21.6    25.0    30.0

We emphasize that the method above is only one way of obtaining descriptive statistics for group week by week.

Further, suppose one would like to obtain descriptive statistics for the difference between the treatment and control groups (i.e., the Difference variable). Again, these are values that we consider by week.

sum(chapter_1_table_1[chapter_1_table_1$week==1, "Difference"])

## [1] 22

mean(chapter_1_table_1[chapter_1_table_1$week==1, "Difference"])

## [1] 4.4

sd(chapter_1_table_1[chapter_1_table_1$week==1, "Difference"])

## [1] 7.127412

summary(chapter_1_table_1[chapter_1_table_1$week==1, "Difference"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -4.0    -2.0     6.0     4.4    10.0    12.0

sum(chapter_1_table_1[chapter_1_table_1$week==2, "Difference"])

## [1] 24

mean(chapter_1_table_1[chapter_1_table_1$week==2, "Difference"])

## [1] 4.8

sd(chapter_1_table_1[chapter_1_table_1$week==2, "Difference"])

## [1] 7.293833

summary(chapter_1_table_1[chapter_1_table_1$week==2, "Difference"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -4.0    -2.0     8.0     4.8    10.0    12.0

If we want to calculate, for example, the sum and mean, of the scores across week for each of the variables (Treatment, Control, & Difference), we can leave blank the value to the left of the comma within the square brackets. Leaving the rows (or columns) blank returns all rows (or columns). Keep in mind that the code below ignores Week.

sum(chapter_1_table_1[ , "treat"])

## [1] 267

sum(chapter_1_table_1[ , "control"])

## [1] 221

sum(chapter_1_table_1[ , "Difference"])

## [1] 46

mean(chapter_1_table_1[ , "treat"])

## [1] 26.7

mean(chapter_1_table_1[ , "control"])

## [1] 22.1

mean(chapter_1_table_1[ , "Difference"])

## [1] 4.6

Chapter 1 discusses counting rules. There are multiple ways of addressing such questions within R, from specialized packages, use of the factorial function (which is factorial()) with definitional formulas, and use of the choose() and combn() functions, among others. Recall that combinations are unordered choices (whereas permutations are ordered choices). For combinations, R has a function that allows a user to easily count and enumerate all combinations. Recalling Chapter 1, the number of ways in which 2 objects can be chosen from a set of 8 [in which order does not matter, meaning that (A, B) is the same as (B, A)] can be addressed as follows

choose(n=8, k=2)

## [1] 28

If one wants to know the number of ways that “eight things can be taken four at a time,” the choose() function would be used as

choose(n=8, k=4)

## [1] 70

To actually enumerate (list) the various ways in which one can choose 2 from 8, the combn() function can be used, which returns a matrix of the various ways in which the objects can be selected (i.e., it is an enumerated listing of the varoius ways). Note that the number of rows in the returned matrix will be m with the number of columns being the number of possible ways (which is equivalent in number to the above use of the choose() function).

combn(x=8, m=2)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]    1    1    1    1    1    1    1    2    2     2     2     2     2
## [2,]    2    3    4    5    6    7    8    3    4     5     6     7     8
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,]     3     3     3     3     3     4     4     4     4     5     5
## [2,]     4     5     6     7     8     5     6     7     8     6     7
##      [,25] [,26] [,27] [,28]
## [1,]     5     6     6     7
## [2,]     8     7     8     8

Furthermore, the factorial function can be used to address questions such as “how many ways can one choose 2 cups out of eight, where order does not matter.”

factorial(8)/(factorial(2)*factorial(6))

## [1] 28

Further, note that one can compute the number of permutations of some number things taken some amout at a time, where order does matter (i.e., allowing the same cups but with different orders). For example, in the book we discuss the number of ways of choosing 4 cups from 8 cups in a particular order. This can be computed as follows (do note, however, that these are not the distinct sets of cups).

factorial(8)/(factorial(4))

## [1] 1680

Notice that, directly above, there is only one term in the denominator [i.e., factorial(8-4)]. Thus, each set of four particular cups would appear 4 × 3 × 2 × 1, or 24, times in a listing of the 1680 orderings. Thus, one can divide the number of ways in which order matters, here 1680, by (8-4)!=4*3*2*1 to find the number of distinct orders (or, simply find the permutation of 8 things taken 4 at a time). This will yield a smaller answer because it will consider cups selected as number 1, 2, 3, and 4 the same as cups selected as number 4, 3, 2, and 1.

factorial(8)/(factorial(4)*factorial(8-4))

## [1] 70

Unlike the choose() function for finding the number of combinations, R does not have a base function for dealing with permutations directly. However, see the permutations function in the gtools package).

In the context of the paired differences as discussed, for the 10 pairs in which we are interested in the number of possible combinations of the signed differences that could have occurred, we compute this in R using the “hat” symbol that denotes an exponent, with 2 being the number (for positive or negative sign) for which the exponent 10 (because we have 10 pairs) is used

2^10

## [1] 1024

Of course, considering only the first week with the 5 pairs, we have:

2^5

## [1] 32

Given the above background and returning to the data from Table 1.1, we walk, step-by-step, through a procedure that can be used for a permutation test as discussed. First, it is useful to know about the expand.grid() function. This function provides all combinations of the supplied vectors (or factors). For us, consider that the value of -4 observed for pair 1 could have been 4 if the (paired) scores had been sampled in opposite fashion and in reality there was no difference between the treatment and control. Similarly, the observed value of 6 could have been -6, et cetera. By supplying vectors of the positive and negative values of those observed in separate vectors, expand.grid will supply all combinations of the scores.

Possible.Assignments <- expand.grid(c(4, -4), c(6, -6), c(10, -10), c(2, -2), c(12, -12))
Possible.Assignments

##    Var1 Var2 Var3 Var4 Var5
## 1     4    6   10    2   12
## 2    -4    6   10    2   12
## 3     4   -6   10    2   12
## 4    -4   -6   10    2   12
## 5     4    6  -10    2   12
## 6    -4    6  -10    2   12
## 7     4   -6  -10    2   12
## 8    -4   -6  -10    2   12
## 9     4    6   10   -2   12
## 10   -4    6   10   -2   12
## 11    4   -6   10   -2   12
## 12   -4   -6   10   -2   12
## 13    4    6  -10   -2   12
## 14   -4    6  -10   -2   12
## 15    4   -6  -10   -2   12
## 16   -4   -6  -10   -2   12
## 17    4    6   10    2  -12
## 18   -4    6   10    2  -12
## 19    4   -6   10    2  -12
## 20   -4   -6   10    2  -12
## 21    4    6  -10    2  -12
## 22   -4    6  -10    2  -12
## 23    4   -6  -10    2  -12
## 24   -4   -6  -10    2  -12
## 25    4    6   10   -2  -12
## 26   -4    6   10   -2  -12
## 27    4   -6   10   -2  -12
## 28   -4   -6   10   -2  -12
## 29    4    6  -10   -2  -12
## 30   -4    6  -10   -2  -12
## 31    4   -6  -10   -2  -12
## 32   -4   -6  -10   -2  -12

Now, we want the sum of the various combinations of differences based on all possible assignments. We use the apply() function, which takes a function (in this case sum()) and applies the function row-by-row (when MARGIN=1) to the specified object (in this case Possible.Assignments). Then, we use the sort() function (to match the book we specify decreasing=TRUE). Finally, we identify the combinations that lead to a sum that is greater than or equal to 22, and then find the proportion of such cases.

Sums <- apply(Possible.Assignments, 1, sum)
Sums

##  [1]  34  26  22  14  14   6   2  -6  30  22  18  10  10   2  -2 -10  10
## [18]   2  -2 -10 -10 -18 -22 -30   6  -2  -6 -14 -14 -22 -26 -34

sort(Sums, decreasing=TRUE)

##  [1]  34  30  26  22  22  18  14  14  10  10  10   6   6   2   2   2  -2
## [18]  -2  -2  -6  -6 -10 -10 -10 -14 -14 -18 -22 -22 -26 -30 -34

# See which are greater than or equal to 22 (the original)
Sums >= 22

##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# The mean of TRUE/FALSE data is the proportion of TRUE.
# Note that TRUE is treated numerically as 1 and FALSE is treated numerically as 0).
mean(Sums >= 22)

## [1] 0.15625

Or, more succintly

# Observed sum of the differences.
Value <- sum(chapter_1_table_1[chapter_1_table_1$week==1, "Difference"])  

# Select the observed and the opposite sign differences.  
Diffs <- cbind(chapter_1_table_1[chapter_1_table_1$week==1, "Difference"], 
               -chapter_1_table_1[chapter_1_table_1$week==1, "Difference"]) 

# All combinations of differences.
Possible.Assignments <- expand.grid(lapply(split(Diffs, row(Diffs)), matrix, ncol=2))  

# First, find the sum of the combinations. Then, find the proportion greater than the 
# observed value.
mean(apply(Possible.Assignments, 1, sum) >= Value)

## [1] 0.15625

Had we wanted to perform a t-test, there are two ways in which we can procede First, we already have the differences. A single sample t-test on the difference scores is equivalant to a paired samples t-test. The t.test() function in R performs, based on the specifications, a single sample t-test, a paired-samples t-test, or an independent-samples t-test (with or without assuming homogeneity of variance). If only a single variable is specified, a single-sample t-test is performed (with the null value set to 0, which can be modified using mu=X, where X is the null value of interest). For a single sample t-test we can thus use the differences for week=1 and use the following for a two-sided p-value (we show a single-sided p-value momentarily):

t.test(chapter_1_table_1[chapter_1_table_1$week==1, "Difference"])

## 
##  One Sample t-test
## 
## data:  chapter_1_table_1[chapter_1_table_1$week == 1, "Difference"]
## t = 1.3804, df = 4, p-value = 0.2396
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -4.449851 13.249851
## sample estimates:
## mean of x 
##       4.4

Alternatively, we could have specified the Treatment and Control scores and identified the data as a paired design:

t.test(chapter_1_table_1[chapter_1_table_1$week==1, "treat"], 
       chapter_1_table_1[chapter_1_table_1$week==1, "control"], paired=TRUE)

## 
##  Paired t-test
## 
## data:  chapter_1_table_1[chapter_1_table_1$week == 1, "treat"] and chapter_1_table_1[chapter_1_table_1$week == 1, "control"]
## t = 1.3804, df = 4, p-value = 0.2396
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.449851 13.249851
## sample estimates:
## mean of the differences 
##                     4.4

Doing the above across the two weeks (i.e., using the 10 differences), we could do the following:

t.test(chapter_1_table_1[, "treat"], chapter_1_table_1[, "control"], paired=TRUE)

## 
##  Paired t-test
## 
## data:  chapter_1_table_1[, "treat"] and chapter_1_table_1[, "control"]
## t = 2.1386, df = 9, p-value = 0.06116
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2658294  9.4658294
## sample estimates:
## mean of the differences 
##                     4.6

Or, using the $ operator to extract a variable (as we are not subsetting rows) could be done as follows:

t.test(chapter_1_table_1$treat, chapter_1_table_1$control, paired=TRUE)

## 
##  Paired t-test
## 
## data:  chapter_1_table_1$treat and chapter_1_table_1$control
## t = 2.1386, df = 9, p-value = 0.06116
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2658294  9.4658294
## sample estimates:
## mean of the differences 
##                     4.6

However, recalling from Chapter 1 that a single-sided (one-sided) test is performed, we need the p-value that corresponds to the alternative hypothesis of interest. The three options for the alternative hypothesis can be specified using the alternative argument (with two.sided (the default), less, or greater), depending on the specified alternative hypothesis of interest. Here we use a greater than alternative hypothesis (to complement the less than or equal to null hypothesis)

t.test(chapter_1_table_1$treat, chapter_1_table_1$control, paired=TRUE, alternative="greater")

## 
##  Paired t-test
## 
## data:  chapter_1_table_1$treat and chapter_1_table_1$control
## t = 2.1386, df = 9, p-value = 0.03058
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.6570313       Inf
## sample estimates:
## mean of the differences 
##                     4.6

Chapter 1 also discusses Fisher’s Exact test. The R function for implementing Fisher’s Exact test is fisher.test(). The R help files demonstrates the use of the function for the tea problem as is discussed (this is a famous data set). We modify the code given in the helpfile for our purposes here. First, note that a matrix is created that summarizes the results. One dimension (rows) identifies the guesses, whereas the other dimension (columns) identifies the actual classification. Use of the ‘matrix()’ function can make inputting tabular data easier.

TeaTasting <- matrix(c(3, 1, 1, 3),
       nrow = 2,
       dimnames = list(Guess  = c("Milk-first", "Tea-first"),
                       Actual = c("Milk-first", "Tea-first")))

# An alternative way of entering the data could be 
rbind(MF=c(3, 1), TF=c(1, 3))

##    [,1] [,2]
## MF    3    1
## TF    1    3

# Note that this is one interpretation of the type of test. Here the estimate is a 
# conditional maximum likelihood estimate and differs from the sample odds ration 
# (see the help file [?fisher.test] for how R addresses this 2-by-2 table, but 
# do note that there are other approaches). 
fisher.test(TeaTasting, alternative = "greater")

## 
##  Fisher's Exact Test for Count Data
## 
## data:  TeaTasting
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.3135693       Inf
## sample estimates:
## odds ratio 
##   6.408309