Statistical models in R

Christophe Lalanne
October 22, 2013

Synopsis

The analysis of variance is not a mathematical theorem, but rather a convenient method of arranging the arithmetic. Ronald Fisher (1890-1962)

design of experiments • split-apply-combine • one-way and two-way ANOVA • interaction

Lectures: OpenIntro Statistics, 4.2, 5.2, 5.5.

Design of experiments

Maximize precision while minimizing number of trials.

Implementation of an organized set of experimental units to characterize the effect of certain treatments or combination of treatments, on one or more response variables.

Taking into account one or more nuisance factors for the establishment of experimental design: organize sources of unwanted variation so that we can say that they affect treatment equivalently, making the comparison between treatments possible.

Some examples

  • Parallel (independent) groups
  • Completely randomized design
  • Incomplete block design (e.g., Latin square)
  • Split-plot design
  • Repeated measures, including cross-over trials

Randomization (random allocation of units to treatments–experimental vs. quasi-experimental design), factorial arrangement of treatments, and blocking (grouping of similar units based on known but irrelevant characteristics) are keys components of experimental design (Montgomery, 2012).

Describing variables relationships

R relies on a 'formula' to describe the relation between one or multiple response variables and one or more explanatory variables, according to Wilkinson & Rogers's notation (Wilkinson & Rogers, 1973; Chambers & Hastie, 1992).

In the case of ANOVA and regression, a single response variable is put on the left of the ~ operator, followed by

RHS Variable type Meaning Equiv. to
x numeric simple linear regression 1 + x
x + 0 numeric idem without intercept x - 1
a + b factor (numeric) two main crossed effects
a * b factor idem including interaction 1 + a + b + a:b
a / b factor nested relationship 1 + a + b + a %in% b

R's formula and data analysis

Most of the time, the same formula can be used to perform several 'data steps' (tidying and summarizing data, plotting, or reporting), but it is also the core element of many statistical models in R (linear and generalized linear models, decision trees, partitioning methods, etc.).
The use of formulae means we also need to work with a well-arranged data frame, for which reshaping (melting/casting) are essential tools.

d <- data.frame(x1 = rnorm(n=5, mean=10, sd=1.5), 
                x2 = rnorm(n=5, mean=12, sd=1.5))
d[1:3,]  # same as head(d, n=3)
      x1    x2
1  9.511 13.76
2 10.829 12.93
3  8.988 11.83

R's formula and data analysis (con't)

t.test(d$x1, d$x2, var.equal=TRUE)$p.value
[1] 0.001021
library(reshape2)
dm <- melt(d)     # switch from wide to long format
head(dm)
  variable  value
1       x1  9.511
2       x1 10.829
3       x1  8.988
4       x1 10.322
5       x1 10.466
6       x2 13.761
t.test(value ~ variable, data=dm, var.equal=TRUE)$p.value
[1] 0.001021

Note: R's formulae (and data frames, de facto) have been ported to Python and Julia.

The Split-Apply-Combine strategy

“(…) break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.” (Wickham, 2011)
See the plyr package (we won't use it, though).

sac

Split-Apply-Combine (con't)

Here is a working example:

x <- rnorm(n=15)          # 15 random gaussian variates
grp <- gl(n=3, k=5, labels=letters[1:3])
spl <- split(x, grp)      # split x by levels of grp
apl <- lapply(spl, mean)  # apply mean() to each split 
cbn <- rbind(x=apl)       # combine the means
cbn
  a       b       c       
x -0.1088 -0.7705 -0.08854

Shorter version (other than by(), tapply(), or ave()):

aggregate(x ~ grp, FUN=mean)
  grp        x
1   a -0.10883
2   b -0.77052
3   c -0.08854

One-way ANOVA

Let \( y_{ij} \) be the \( j\text{th} \) observation in group \( i \) (factor \( A \), with \( a \) levels). An effect model can be written as

\[ y_{ij} = \mu + \alpha_i + \varepsilon_{ij}, \]

where \( \mu \) stands for the overall (grand) mean, \( \alpha_i \) is the effect of group or treatment \( i \) (\( i=1,\dots,a \)), and \( \varepsilon_{ij}\sim \mathcal{N}(0,\sigma^2) \) reflects random error. The following restriction is usually considered: \( \sum_{i=1}^a\alpha_i=0 \).

The null hypothesis reads: \( H_0:\alpha_1=\alpha_2=\dots=\alpha_a \). It can be tested with an F-test with \( a-1 \) et \( N-a \) degrees of freedom.

The big picture

Each observation can be seen as a deviation from its group mean, \( y_{ij}=\bar y_i+\varepsilon_{ij} \). Then, the whole variability can be expressed as follows:

\[ \underbrace{(y_{ij}-\bar y)}_{\text{total}}=\underbrace{(\bar y_{i\phantom{j}}\hskip-.5ex-\bar y)}_{\text{group}} + \underbrace{(y_{ij}-\bar y_i)}_{\text{residuals}}. \]

anovadecomp

Assumptions, caveats, etc.

  • This is an omnibus test for which the alternative hypothesis reads \( \exists i,j\mid \alpha_i\neq\alpha_j,\: i, j=1,\dots,a\, (i\neq j) \). If the result is significant, it doesn't tell us which pairs of means really differ.
  • Beside independence of observations, this model assumes that variances are equal in each population (which is hard to assess with formal tests) and that residuals are approximately normally distributed.
  • As always, a statistically significant result does not necessarily mean an interesting result from a practical point of view: We need to provide a summary of raw or standardized effects.

Different scenarios

anova

Application

Effect of different sugars on length of pea sections grown in tissue culture with auxin present. (Sokal & Rohlf, 1995)

ctrl X2G X2F X1G1F X2S
1 75 57 58 58 62
2 67 58 61 59 66
3 70 60 56 58 65
4 75 59 58 61 63
5 65 62 57 57 64
6 71 60 56 56 62
7 67 60 61 58 65
8 67 57 60 57 65
9 76 59 57 57 62
10 68 61 58 59 67

Application (con't)

  • Response variable: length (in ocular units) of pea sections.
  • Explanatory variable: treatment (control, 2% glucose, 2% fructose, 1% glucose + 1% fructose, 2% sucrose).
head(peas, n=3)
  ctrl X2G X2F X1G1F X2S
1   75  57  58    58  62
2   67  58  61    59  66
3   70  60  56    58  65
peas.melted <- melt(peas, value.name="length", 
                    variable.name="treatment")
head(peas.melted, n=3)
  treatment length
1      ctrl     75
2      ctrl     67
3      ctrl     70

Application (con't)

f <- function(x) c(mean=mean(x), sd=sd(x))
aggregate(length ~ treatment, data=peas.melted, f)
  treatment length.mean length.sd
1      ctrl      70.100     3.985
2       X2G      59.300     1.636
3       X2F      58.200     1.874
4     X1G1F      58.000     1.414
5       X2S      64.100     1.792

plot of chunk unnamed-chunk-9

Application (con't)

mod <- aov(length ~ treatment, data=peas.melted)
summary(mod)
            Df Sum Sq Mean Sq F value  Pr(>F)
treatment    4   1077   269.3    49.4 6.7e-16
Residuals   45    246     5.5                
model.tables(mod)
Tables of effects

 treatment 
treatment
 ctrl   X2G   X2F X1G1F   X2S 
 8.16 -2.64 -3.74 -3.94  2.16 
aggregate(length ~ treatment, peas.melted, mean)$length - mean(peas.melted$length)
[1]  8.16 -2.64 -3.74 -3.94  2.16

Application (con't)

Model fit = data + residual

d <- cbind.data.frame(peas.melted, 
                      fit = fitted(mod), 
                      residual = resid(mod))
res <- aggregate(.  ~ treatment, data=d, mean)
res$effect <- res$fit - mean(peas.melted$length)
res
  treatment length  fit   residual effect
1      ctrl   70.1 70.1 -1.232e-15   8.16
2       X2G   59.3 59.3  5.995e-16  -2.64
3       X2F   58.2 58.2 -6.665e-17  -3.74
4     X1G1F   58.0 58.0  1.999e-16  -3.94
5       X2S   64.1 64.1  3.108e-16   2.16

Two-way ANOVA

Let \( y_{ijk} \) be the \( k\text{th} \) observation for level \( i \) of factor \( A \) (\( i=1,\dots,a \)) and level \( j \) of factor \( B \) (\( j=1,\dots,b \)). The full model can be written as follows:

\[ y_{ijk} = \mu + \alpha_i + \beta_j + \gamma_{ij} + \varepsilon_{ijk}, \]

where \( \mu \) is the overall mean, \( \alpha_i \) (\( \beta_j \)) is the deviation of the \( a \) (\( b \)) means from the overall mean, \( \gamma_{ij} \) is the deviation of the \( A\times B \) treatments from \( \mu \), and \( \varepsilon_{ijk}\sim {\cal N}(0,\sigma^2) \) is the residual term.

The \( \alpha_i \) et \( \beta_j \) are called main effects, and \( \gamma_{ij} \) is the interaction effect.

Test of null hypothesis

Null hypotheses associated to the full factorial design are given below:

  • \( H_0^A:\, \alpha_1=\alpha_2=\dots=\alpha_a \) (a-1 dof), No effect of A
  • \( H_0^B:\, \beta_1=\beta_2=\dots=\beta_b \) (b-1 dof), No effect of B
  • \( H_0^{AB}:\, \gamma_{11}=\gamma_{13}=\dots=\gamma_{ab} \) ((a-1)(b-1) dof), No interaction between A and B

The ratio of Mean Squares corresponding to each factor and that of the residuals can be tested with Fisher-Snedecor F-tests.

Interaction between factors

anova

References

Chambers J and Hastie T (1992). Statistical Models in S. Wadsworth & Brooks. ISBN: 0534167649.

Montgomery D (2012). Design and Analysis of Experiments, 8th edition. John Wiley & Sons.

Sokal R and Rohlf F (1995). Biometry, 3rd edition. WH Freeman and Company.

Wickham H (2011). “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software, 40(1).

Wilkinson G and Rogers C (1973). “Symbolic description of factorial models for analysis of variance.” Applied Statistics, 22, pp. 392-399.