Abstract

We investigate Tukey’s One Degree of Freedom test for non-additivity (interaction) when there is only one entry per cell in the 2-way anova design. Following the logic and presentation by Milliken and Johnson (1989), we solve the problem for the sorghum dataset using R and then simplify and re-solve the problem using the R package dae (Brien, 2016).

Introduction

We investigate Tukey’s One Degree of Freedom test for non-additivity (interaction) when there is only one entry per cell in the 2-way analysis of variance (anova) design. We draw largely upon two sources: (1) Analysis of Messy Data Volume 2 Nonreplicated Experiments (Milliken and Johnson, 1989, pp 2-12); and (2) an unauthored PDF from the Univeristy of New Brunswick, “Notes on Tukey’s One Degree of Freedom Test for Interaction” found on the Web: http://www.math.unb.ca/~rolf/Courses/07w/3373/notes-tukey1df.pdf. Source (1) (Milliken and Johnson) solves the problem for the growth rate of sorghum plants using SAS. Source (2) solves the problem using Minitab. Following the general logic and presentations found in sources (1) and (2), we analyze and solve the problem for the same sorghum dataset using R. We then show how to simplify the analysis and re-solve the problem using the dae package (Brien, 2016) in R.

The sorghum data

Following the presentation given by Milliken and Johnson (1989, pp 2-3), suppose an investigator has 20 growth chambers and wishes to study the effects of five temperature levels in combination with four humidity levels on the growth rate of sorghum plants. The investigator places 10 sorghum plants of the same species in each of the 20 growth chambers and assigns the Temperature \(\times\) Humidity combinations randomly to the 20 chambers. After the sorghum plants had been growing for four weeks, the average (mean) height of the ten plants for each chamber was recorded. The sorghum data are shown in Table 1.

Table 1: Mean Height of Sorghum Plants

Humidity (%)
Temperature (F\(^{\circ}\)) 20% 40% 60% 80%
50 F\(^{\circ}\) 12.3 19.6 25.7 30.4
60 F\(^{\circ}\) 13.7 16.9 27.0 31.5
70 F\(^{\circ}\) 17.8 20.0 26.3 35.9
80 F\(^{\circ}\) 12.1 17.4 36.9 43.4
90 F\(^{\circ}\) 6.9 18.8 35.0 53.0

General analysis

How should we set up a statistical model for this 2-way design with one entry per cell? There are two factors “T” (Temperature) and “H” (Humidity), which correspond to the 5 rows and 4 columns in Table 1.

Let \(y_{ij}\) (height) be the single observation corresponding to \(ij^{th}\) cell (row \(i\) and column \(j\)) in the table. A means model for experiments such as this is given by

\[ y_{ij} = \mu_{ij} + \epsilon_{ij}, \hspace{5mm} i = 1,...,5, \hspace{5mm} j = 1,...,4 \]

where

\[ \epsilon_{ij} \sim \mbox{i.i.d.} \hspace{5mm} N(0, \sigma^{2}) \]

Note that it is the residuals (\(\epsilon_{ij}\)), not the observations (\(y_{ij}\)), which are assumed to be independently and identically distributed normal random variates. In this model there are 20 parameters and 20 observations. The best estimate of the \(\mu_{ij}\) is \(\hat{\mu}{_{ij}} = y_{ij}\), i = 1,…,5, j = 1,…,4. However, in this model, there is no estimate of \(\sigma^{2}\) available unless the investigator is able to make some assumptions about the \(\mu_{ij}\)’s. One assumption that could be made is that the two sets of treatments do not interact. This is equivalent to assuming that \(\mu_{ij} - \mu_{i^{\prime}j} - \mu_{ij^{\prime}} + \mu_{i^{\prime}j^{\prime}} = 0\) \(\forall\) \(i, j, i^{\prime}\) and \(j^{\prime}\). If there is no interaction between the levels of the treatments, the best estimate of \(\sigma^{2}\) is

\[ \hat{\sigma}^{2} = \sum_{i = 1}^{5} \sum_{j = 1}^{4} (y_{ij} - \bar{y}_{i.} - \bar{y}_{.j} + \bar{y}_{..})^{2} / (5-1)(4-1) \]

where the \((5-1)(4-1)\) in the denominator represents the degrees of freedom for the interaction.

Changing the notation slightly, let \(Y_{ij}\) be the (single) Height measurement in cell (\(i, j\)), \(T_{i}\) is the \(i^{th}\) level of Temperature (\(T\)), \(H_{j}\) is the \(j^{th}\) level of Humidity (\(H\)), and \((TH)_{ij}\) is the interaction between level \(i\) of \(T\) and level \(j\) of \(H\). An appropriate model for this would be

\[ Y_{ij} = \mu + T_{i} + H_{j} + (TH)_{ij} + E_{ij} \]

For this experiment, we would consume 1 df used for the grand mean, 4 df for Temperature, 3 df for Humidity and 4 \(\times\) 3 = 12 df for the interaction of Temperature and Humidity. This sums to 1 + 4 + 3 + 12 = 20 df for the model. In other words, the model is fully saturated. If we tried to fit this model with anova we would have no degrees of freedom left for the estimate for the error term, \(E_{ij}\).

John Tukey proposed an elegant solution to this problem (1949). It is popularly known as Tukey’s One Degree of Freedom Test for Non-Additivity. One way to model this is:

\[ Y_{ij} = \mu + T_{i} + H_{j} + \lambda \cdot T_{i} \cdot H_{j} + E_{ij} \]

where \(\lambda\) is the scalar multiple of interaction between the row and column main effects. A test for interaction in the model is made by testing \(H_{0}: \lambda = 0\) vs \(H_{0}: \lambda \ne 0\).

Specific analysis in R

How do we perform Tukey’s 1df test in R? Begin with a graph. Figure 1 shows a simple interaction plot of the sorghum data.

Temperature by Humidity Interaction Plot of `sorghum` data.

Temperature by Humidity Interaction Plot of sorghum data.

If there is no interaction between the levels of one factor and those of another, the segments of the trace lines should be statistically parallel to each other. The presence of interaction would suggest that the segments of the trace lines are not parallel to each other.

The idea of interaction for the sorghum data would be reflected, graphically, by the non-parallelism of the trace lines for each humidity level across temperature levels. Is this the case? We cannot simply eyeball the graph to determine whether interaction is present. We must conduct a formal test; but how do we do it in R?

Tukey’s solution is to first fit an additive model, described here as

\[ Y_{ij} = \mu + T_{i} + H_{j} + E_{ij} \]

The next step is to use the estimates of \(T_{i}\) and \(H_{j}\) to create a new variable equal to the product of these estimates, \(\hat{T}_i \hat{H}_j\). Now, fit a second model,

\[ Y_{ij} = \mu + T_{i} + H_{j} + \lambda \cdot \hat{T}_i \hat{H}_j + E_{ij} \]

In this model, \(\hat{T}_i \hat{H}_j\) is a covariate and \(\lambda\) is the coefficient we wish to estimate.

So, let’s work out the solution in R.

Displayed is the entire sorghum data frame, which has been tidied from the matrix of values in Table 1.

##    Temp Humid Height
## 1    50    20   12.3
## 2    50    40   19.6
## 3    50    60   25.7
## 4    50    80   30.4
## 5    60    20   13.7
## 6    60    40   16.9
## 7    60    60   27.0
## 8    60    80   31.5
## 9    70    20   17.8
## 10   70    40   20.0
## 11   70    60   26.3
## 12   70    80   35.9
## 13   80    20   12.1
## 14   80    40   17.4
## 15   80    60   36.9
## 16   80    80   43.4
## 17   90    20    6.9
## 18   90    40   18.8
## 19   90    60   35.0
## 20   90    80   53.0

Temp and Humid are factor variables and Height is the criterion measure. Here is a summary of the fit for the additive model.

## 
## Call:
## lm(formula = Height ~ Temp + Humid, data = sorghum)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.055 -3.241  0.345  3.051 10.765 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    9.530      3.654   2.608 0.022867
## Temp60         0.275      4.085   0.067 0.947433
## Temp70         3.000      4.085   0.734 0.476790
## Temp80         5.450      4.085   1.334 0.206897
## Temp90         6.425      4.085   1.573 0.141720
## Humid40        5.980      3.654   1.637 0.127616
## Humid60       17.620      3.654   4.823 0.000417
## Humid80       26.280      3.654   7.193  1.1e-05
## 
## Residual standard error: 5.777 on 12 degrees of freedom
## Multiple R-squared:  0.8467, Adjusted R-squared:  0.7572 
## F-statistic: 9.465 on 7 and 12 DF,  p-value: 0.0004562

We now calculate the marginal temperature (Temp) and humidity (Humid) means. These are, respectively:

##   T_50   T_60   T_70   T_80   T_90 
## 22.000 22.275 25.000 27.450 28.425
##  H_20  H_40  H_60  H_80 
## 12.56 18.54 30.18 38.84

In order to get the \(\hat{T}_i\) and \(\hat{H}_j\) values, we must subtract the grand mean, \(\bar{Y}_{ij}\), from the marginal temperature and humidity values. The grand mean, GM, for these data is 25.03. The resulting \(\hat{T}_i\) and \(\hat{H}_j\) values are thus:

## [1] -3.030 -2.755 -0.030  2.420  3.395
## [1] -12.47  -6.49   5.15  13.81

Now that the intermediate values have been computed, we are ready to construct an augmented data frame for the next fit. For completeness, we display this entire new, augmented data frame (sorghum.aug).

##    Temp Humid Height  T_hat  H_hat    lambda
## 1    50    20   12.3 -3.030 -12.47  37.78410
## 2    50    40   19.6 -3.030  -6.49  19.66470
## 3    50    60   25.7 -3.030   5.15 -15.60450
## 4    50    80   30.4 -3.030  13.81 -41.84430
## 5    60    20   13.7 -2.755 -12.47  34.35485
## 6    60    40   16.9 -2.755  -6.49  17.87995
## 7    60    60   27.0 -2.755   5.15 -14.18825
## 8    60    80   31.5 -2.755  13.81 -38.04655
## 9    70    20   17.8 -0.030 -12.47   0.37410
## 10   70    40   20.0 -0.030  -6.49   0.19470
## 11   70    60   26.3 -0.030   5.15  -0.15450
## 12   70    80   35.9 -0.030  13.81  -0.41430
## 13   80    20   12.1  2.420 -12.47 -30.17740
## 14   80    40   17.4  2.420  -6.49 -15.70580
## 15   80    60   36.9  2.420   5.15  12.46300
## 16   80    80   43.4  2.420  13.81  33.42020
## 17   90    20    6.9  3.395 -12.47 -42.33565
## 18   90    40   18.8  3.395  -6.49 -22.03355
## 19   90    60   35.0  3.395   5.15  17.48425
## 20   90    80   53.0  3.395  13.81  46.88495

The last column (lambda) is the product of the two previous columns. We have named it lambda so that it will be reported in the output as the estimate for \(\lambda\), \(\hat{\lambda}\), which we seek.

It is instructive to compare the summary of the fitted interaction model to the previous additive model. Here is the summary of the model…

## 
## Call:
## lm(formula = Height ~ Temp + Humid + lambda, data = sorghum.aug)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8279 -1.7335  0.2861  1.4360  5.2166 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  9.53000    2.01625   4.727 0.000623
## Temp60       0.27500    2.25424   0.122 0.905105
## Temp70       3.00000    2.25424   1.331 0.210173
## Temp80       5.45000    2.25424   2.418 0.034148
## Temp90       6.42500    2.25424   2.850 0.015791
## Humid40      5.98000    2.01625   2.966 0.012839
## Humid60     17.62000    2.01625   8.739 2.79e-06
## Humid80     26.28000    2.01625  13.034 4.95e-08
## lambda       0.14273    0.02678   5.329 0.000241
## 
## Residual standard error: 3.188 on 11 degrees of freedom
## Multiple R-squared:  0.9572, Adjusted R-squared:  0.9261 
## F-statistic: 30.74 on 8 and 11 DF,  p-value: 1.84e-06

All the coefficients for Temperature and Humidity are the same as before. The lambda coefficient, \(\hat{\lambda}\), is 0.1427297 with standard error 0.0267819 and \(p = 0.000241\), highly significant. This means that the Temperature \(\times\) Humidity profiles in Figure 1 are not statistically parallel.

The anova summary table for this model is shown below; it provides the same information about \(\hat{\lambda}\) as shown in the summary.

## Analysis of Variance Table
## 
## Response: Height
##           Df  Sum Sq Mean Sq F value    Pr(>F)
## Temp       4  136.62   34.15  3.3606 0.0498445
## Humid      3 2074.30  691.43 68.0331 2.183e-07
## lambda     1  288.65  288.65 28.4017 0.0002413
## Residuals 11  111.79   10.16

Alternative solution with package dae

Now that we understand the logic behind Tukey’s elegant model, we may simplify the process with use of the dae package (Brien, 2016). First, call in the dae library.

## load package `dae`
library(dae)

Next, invoke the aov function to fit the additive model while specifying the Error term as Humid nested within Temp, i.e., Error(Temp/Humid). (Note: The expression Temp/Humid actually computes the interaction sum of squares, Temp:Humid. Entering Error(Temp:Humid) in the equation code also “works”. However, entering Error(Temp/Humid) provides a cleaner output and avoids an error message indicating singularity in the fitted model.)

sorghum.add.Error.aov<-aov(Height~Temp+Humid+Error(Temp/Humid),data=sorghum)
## display the aov object
sorghum.add.Error.aov
## 
## Call:
## aov(formula = Height ~ Temp + Humid + Error(Temp/Humid), data = sorghum)
## 
## Grand Mean: 25.03
## 
## Stratum 1: Temp
## 
## Terms:
##                    Temp
## Sum of Squares  136.617
## Deg. of Freedom       4
## 
## Estimated effects may be unbalanced
## 
## Stratum 2: Temp:Humid
## 
## Terms:
##                    Humid Residuals
## Sum of Squares  2074.298   400.447
## Deg. of Freedom        3        12
## 
## Residual standard error: 5.776728
## Estimated effects may be unbalanced

Next fit the aov object with the tukey.1df function from the dae package…

tukey.1df.sorghum.add.Error.aov <- tukey.1df( aov.obj = sorghum.add.Error.aov, data = sorghum, error.term = 'Temp:Humid' )

and display the result.

##
tukey.1df.sorghum.add.Error.aov <- tukey.1df(aov.obj=sorghum.add.Error.aov,data=sorghum,error.term='Temp:Humid')
## display the tukey.1df object
tukey.1df.sorghum.add.Error.aov
## $Tukey.SS
## [1] 288.652
## 
## $Tukey.F
## [1] 28.40174
## 
## $Tukey.p
## [1] 0.0002413186
## 
## $Devn.SS
## [1] 111.795

The tukey.1df function lists: (1) Tukey.SS — the sum of squares associated with the estimate for \(\lambda\), \(\hat{\lambda}\), labeled as lambda; (2) Tukey.F — the \(F\)-statistic associated with the estimate for lambda; (3) Tukey.p — the \(p\)-value associated with the estimate for lambda; and (4) Devn.SS — the deviation or Residuals sum of squares obtained in our earlier analysis.

Conclusion

We reviewed the strategy for applying Tukey’s One Degree of Freedom test for non-additivity (interaction) using R. We followed closely the logic, discussion and presentations by: (1) Milliken and Johnson in Analysis of Messy Data Volume 2 Nonreplicated Experiments (1989), pp 2-12; and (2) an unauthored PDF from the University of New Brunswick “Notes on Tukey’s One Degree of Freedom Test for Interaction” found on the Web: http://www.math.unb.ca/~rolf/Courses/07w/3373/notes-tukey1df.pdf. We first analyzed and solved the problem for the sorghum dataset using R and then followed up with a simplification and re-solution using the dae package (Brien, 2016) in R.

References

Brien, Chris (2016). dae: Functions Useful in the Design and ANOVA of Experiments. R package version 2.7-20. https://CRAN.R-project.org/package=dae

http://www.math.unb.ca/~rolf/Courses/07w/3373/notes-tukey1df.pdf

Milliken GA and Johnson DE. (1989) Analysis of Messy Data Volume 2: Nonreplicated Experiments, van Nostrand Reinhold, New York.

Tukey JW. (1949) One degree of freedom for non-additivity, Biometrics, 5:232–242.