2. Engaging the stats-for-psycholinguistics literature: Jaeger (2008)

Read Jaeger 2008, “Categorical Data Analysis” carefully, working through each step of the analysis. Then, respond to the following questions. You may also find it helpful to read Arnon 2010 for context.

  1. Characterize in a precise manner, with formal discussion as needed, what the point of contention is in the initial sections, where Jaeger contrasts ANOVA with ordinary logit models. What exactly is it that he recommends against, and how precisely does his recommended solution differ? (Keep in mind here that ANOVA is a special case of linear regression, and that logistic regression is just linear regression where the object of inference is a binomial parameter transformed into logit space.)

Categorical outcomes are traditionally analyzed using ANOVA over proportions/percentages. Take binary outcomes (A vs B) for example, if the proportion of A is p, then the proportion of B is 1-p. The traditional approach uses ANOVA to infer/predict p, which essentially assumes that p is represented by a linear combination of predictors (plus some noise). Jaeger is against this view, because of two main reasons. First, p is by definition within between 0 and 1, but ANOVA, which is essentially a special case of linear regression, could assign non-zero probabilities beyond [0, 1], which makes no sense. Second, if we assume the outcome is binomially distributed, the variance of the proportion p depends on p (the closer p is to 0 or 1, the greater the changes of variance will be caused by changes of p), which violates ANOVA’s assumption of homogeneous variances.

Jaeger instead recommends using logistic regression, which transforms probabilities into logit space (in the form of log-odds), and does linear regression afterwards. This overcomes the previous two problems. First, [0, 1] corresponds the whole real numbers in the logit space, thus the result of the linear regression in the logit space is always meaningful. Second, logit transformatio captures the fact that differences in probabilities around 0.5 do not matter as much as the same changes close to 0 or 1.

  1. How does the introduction of random effects into logit models address the lingering concern that might have motivated the use of traditional ANOVA? What kind of mixed-effects models do the two kinds of ANOVA he is arguing against (non-transformation and arcsine-transformed) correspond to?

Traditionally, people use ANOVA instead of ordinary logit models because the latter does not model random subject or item effects. Mixed logit models, which are a type of Generalized Linear Mixed Model, address this issue by representing the logit of the outcome probability as the linear combination of fixed effects and conditional random effects (subjects, items, etc.).

  1. What is the non-circular argument that the interaction is actually spurious - i.e., why does Jaeger conclude from the lack of a significant interaction in a mixed logit model that the ANOVAs are defective analyses, on the ground that they find a significant interaction? (N.B. “The mixed-effects logit model is a better model” does not count as an argument here, since this is what Jaeger is trying to demonstrate!)

The standard errors in different conditions (with different probabilities) are actually different, which violates the basic assumption of ANOVA.

  1. Using the data set that Jaeger analyzes (from Inbal Arnon’s experiments on acquisition of Hebrew relative clauses, linked here), replicate Jaeger’s full statistical analysis, showing your R code. Make sure to include all results in the paper that make use of the Arnon data, including the argument that non-transformation and arcsine-transformed ANOVAs yield interactions results, the various mixed-effects logit models that he constructs, and the comparisons between models that Jaeger suggests motivate a reduced random-effects structure.

To load the data into R, use

d <- data.frame(read.delim("inbal.tab"))

You’ll want to restrict attention to the comprehension studies, and remove null data so that the ANOVAs will run. (As Jaeger mentions, the ability to handle missing and unbalanced data is another key advantage of mixed-effects models over ANOVA.)

d <- subset(d, modality == 1) # comprehension study only
d <- subset(d, Correct != "#NULL!" & !is.na(Extraction) & !is.na(NPType))

Finally, you’ll need to recode some numerically coded categorical variables as factors with informative names. Here’s a key:

d$NPType <- ifelse(d$NPType==1, "lexical", "pronoun")
d$NPType <- as.factor(d$NPType)
d$Extraction <- ifelse(d$Extraction==1, "subject", "object") 
d$Extraction <- as.factor(d$Extraction)
d$Correct <- ifelse(d$Correct==1, "correct", "incorrect")
d$Correct <- factor(d$Correct, levels=c("incorrect", "correct"))

d$child <- as.factor(d$child)
d$itemby4 <- as.factor(d$itemby4)

CorrectN is the numeric encoding of Correct: 1 <=> correct, 0 <=> incorrect.

d$CorrectN <- as.numeric(d$Correct) - 1  # factor representation starts from 1
tapply(d$CorrectN, list(factor(d$Extraction, levels=c("subject", "object")), d$NPType), mean)
##           lexical   pronoun
## subject 0.8956044 0.9565217
## object  0.6890244 0.8433735
StandardError <- function(v){
  return(sd(v)/sqrt(length(v)))
}
tapply(d$CorrectN, list(factor(d$Extraction, levels=c("subject", "object")), d$NPType), StandardError)
##            lexical    pronoun
## subject 0.02272793 0.01507502
## object  0.03625657 0.02829441

The results are almost the same as Table 2 in the paper, except that the mean of the LexicalNP-SubjectRC condition is 0.896 here instead of 0.897 in the paper. Either it was a typo or the datasets are slightly different.

I will first replicate the results of logistic regressions and extensions. The results of ANOVA come afterwards.

The design package has been replaced by the rms package in CRAN, so I will use the latter instead.

library(rms)
lrm(Correct ~ Extraction * NPType, data = d)
## 
## Logistic Regression Model
## 
## lrm(formula = Correct ~ Extraction * NPType, data = d)
## 
##                       Model Likelihood     Discrimination    Rank Discrim.    
##                          Ratio Test            Indexes          Indexes       
## Obs           696    LR chi2      51.98    R2       0.126    C       0.707    
##  incorrect    104    d.f.             3    g        0.920    Dxy     0.414    
##  correct      592    Pr(> chi2) <0.0001    gr       2.509    gamma   0.535    
## max |deriv| 3e-10                          gp       0.105    tau-a   0.105    
##                                            Brier    0.117                     
## 
##                                     Coef   S.E.   Wald Z Pr(>|Z|)
## Intercept                           0.7956 0.1687 4.72   <0.0001 
## Extraction=subject                  1.3537 0.2953 4.58   <0.0001 
## NPType=pronoun                      0.8880 0.2721 3.26   0.0011  
## Extraction=subject * NPType=pronoun 0.0537 0.5133 0.10   0.9166

The result is almost identical to Jaeger’s, except that some SEs are slightly different, and the p-value for the NPType main effect is 0.0011, not <0.001 as reported by Jaeger.

library(lme4)

Below is the model specification from (15) in Jaeger’s paper.

glm.inter <- glmer(Correct ~ 1 + Extraction * NPType + (1 + Extraction * NPType | child), 
                   data = d , family = "binomial")
## Warning in (function (fn, par, lower = rep.int(-Inf, n), upper =
## rep.int(Inf, : failure to converge in 10000 evaluations
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.258724
## (tol = 0.001, component 9)
summary(glm.inter)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: Correct ~ 1 + Extraction * NPType + (1 + Extraction * NPType |  
##     child)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##    540.2    603.9   -256.1    512.2      682 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -6.7940  0.1060  0.2632  0.4735  1.0044 
## 
## Random effects:
##  Groups Name                            Variance Std.Dev. Corr          
##  child  (Intercept)                     0.25295  0.5029                 
##         Extractionsubject               0.67634  0.8224   0.68          
##         NPTypepronoun                   0.12657  0.3558   0.95 0.60     
##         Extractionsubject:NPTypepronoun 0.09069  0.3011   0.61 0.43 0.82
## Number of obs: 696, groups:  child, 24
## 
## Fixed effects:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       0.8360     0.2053   4.073 4.65e-05 ***
## Extractionsubject                 1.8271     0.5046   3.621 0.000294 ***
## NPTypepronoun                     1.0804     0.3323   3.252 0.001148 ** 
## Extractionsubject:NPTypepronoun   0.4647     1.0983   0.423 0.672210    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Extrct NPTypp
## Extrctnsbjc -0.189              
## NPTypepronn -0.346  0.230       
## Extrctn:NPT  0.173 -0.394 -0.297

The results are similar to those reported in Table 5 and 6 in the paper, but not exactly the same. For example, while all other coefficients are close to those reported, the coefficient for the interaction term is 0.4647 here, as opposed to 0.59. Random subject effects and correlations are also not exactly the same as reported. This might be due to either differences in the datasets, or different versions of the package (since there is no analytic solution, only various approximation methods. Perhaps the approximation method changed in later versions of the lme4 package.).

glm.nointer <- glmer(Correct ~ 1 + Extraction * NPType + (1 + Extraction + NPType | child), 
                   data = d , family = "binomial")
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.00768967
## (tol = 0.001, component 6)

Next, the model specification from (16) in Jaeger’s paper is as follows:

glm.e <- glmer(Correct ~ 1 + Extraction * NPType + (1 + Extraction | child), 
               data = d , family = "binomial")
summary(glm.e)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: Correct ~ 1 + Extraction * NPType + (1 + Extraction | child)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##    527.6    559.4   -256.8    513.6      689 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -6.0604  0.1170  0.2792  0.4447  1.0804 
## 
## Random effects:
##  Groups Name              Variance Std.Dev. Corr
##  child  (Intercept)       0.3946   0.6282       
##         Extractionsubject 0.7617   0.8727   0.64
## Number of obs: 696, groups:  child, 24
## 
## Fixed effects:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      0.85958    0.22115   3.887 0.000102 ***
## Extractionsubject                1.91571    0.48400   3.958 7.55e-05 ***
## NPTypepronoun                    0.95645    0.28288   3.381 0.000722 ***
## Extractionsubject:NPTypepronoun  0.09542    0.53940   0.177 0.859589    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Extrct NPTypp
## Extrctnsbjc -0.160              
## NPTypepronn -0.468  0.212       
## Extrctn:NPT  0.245 -0.314 -0.524

The results are close to those reported in Table 7 and 8 in the paper. Again, there are some differences, e.g. SE estimates, but in general they are comparable.

anova(glm.inter, glm.nointer, glm.e)
## Data: d
## Models:
## glm.e: Correct ~ 1 + Extraction * NPType + (1 + Extraction | child)
## glm.nointer: Correct ~ 1 + Extraction * NPType + (1 + Extraction + NPType | 
## glm.nointer:     child)
## glm.inter: Correct ~ 1 + Extraction * NPType + (1 + Extraction * NPType | 
## glm.inter:     child)
##             Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
## glm.e        7 527.55 559.37 -256.78   513.55                         
## glm.nointer 10 532.32 577.77 -256.16   512.32 1.2348      3     0.7447
## glm.inter   14 540.24 603.87 -256.12   512.24 0.0804      4     0.9992

The model specification from (17) in Jaeger’s paper is as follows:

summary(glmer(Correct ~ 1 + Extraction * NPType + (1 + Extraction | child) + (1 | itemby4), 
        data = d , family = "binomial"))
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: Correct ~ 1 + Extraction * NPType + (1 + Extraction | child) +  
##     (1 | itemby4)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##    527.9    564.3   -256.0    511.9      688 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -6.3939  0.1164  0.2602  0.4404  1.1441 
## 
## Random effects:
##  Groups  Name              Variance Std.Dev. Corr
##  child   (Intercept)       0.42052  0.6485       
##          Extractionsubject 0.76766  0.8762   0.62
##  itemby4 (Intercept)       0.08672  0.2945       
## Number of obs: 696, groups:  child, 24; itemby4, 8
## 
## Fixed effects:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      0.85050    0.24997   3.402 0.000668 ***
## Extractionsubject                1.96117    0.48873   4.013    6e-05 ***
## NPTypepronoun                    0.99162    0.29144   3.402 0.000668 ***
## Extractionsubject:NPTypepronoun  0.06943    0.54516   0.127 0.898661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Extrct NPTypp
## Extrctnsbjc -0.155              
## NPTypepronn -0.429  0.230       
## Extrctn:NPT  0.230 -0.322 -0.533

The results are close to those reported in Table 9 and 10 in the paper. Again, not perfect replication, but comparable.

Further complications are not necessary because all further random effects are highly correlated with the random intercept (rs > 0.8) and hence unnecessary.

summary(glmer(Correct ~ 1 + Extraction * NPType + (1 + Extraction | child) + (1 + Extraction| itemby4), 
        data = d , family = "binomial"))
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: Correct ~ 1 + Extraction * NPType + (1 + Extraction | child) +  
##     (1 + Extraction | itemby4)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##    530.1    575.5   -255.0    510.1      686 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.3149  0.1154  0.2648  0.4321  1.2667 
## 
## Random effects:
##  Groups  Name              Variance Std.Dev. Corr 
##  child   (Intercept)       0.4477   0.6691        
##          Extractionsubject 0.7890   0.8882   0.54 
##  itemby4 (Intercept)       0.2055   0.4533        
##          Extractionsubject 0.3437   0.5862   -0.90
## Number of obs: 696, groups:  child, 24; itemby4, 8
## 
## Fixed effects:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     0.834049   0.283775   2.939 0.003291 ** 
## Extractionsubject               1.968000   0.539628   3.647 0.000265 ***
## NPTypepronoun                   1.057549   0.307077   3.444 0.000573 ***
## Extractionsubject:NPTypepronoun 0.003758   0.554393   0.007 0.994591    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Extrct NPTypp
## Extrctnsbjc -0.339              
## NPTypepronn -0.394  0.206       
## Extrctn:NPT  0.218 -0.284 -0.554
summary(glmer(Correct ~ 1 + Extraction * NPType + (1 + Extraction | child) + (1 + Extraction* NPType| itemby4), 
        data = d , family = "binomial"))
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: Correct ~ 1 + Extraction * NPType + (1 + Extraction | child) +  
##     (1 + Extraction * NPType | itemby4)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##    539.3    616.6   -252.7    505.3      679 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.4403  0.0914  0.2330  0.4317  1.1896 
## 
## Random effects:
##  Groups  Name                            Variance Std.Dev. Corr       
##  child   (Intercept)                     0.4550   0.6746              
##          Extractionsubject               0.8468   0.9202   0.55       
##  itemby4 (Intercept)                     0.1113   0.3336              
##          Extractionsubject               0.1404   0.3746   -0.50      
##          NPTypepronoun                   0.1517   0.3895    0.95 -0.76
##          Extractionsubject:NPTypepronoun 2.8540   1.6894   -0.72 -0.24
##       
##       
##       
##       
##       
##       
##  -0.45
## Number of obs: 696, groups:  child, 24; itemby4, 8
## 
## Fixed effects:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       0.8433     0.2593   3.252 0.001145 ** 
## Extractionsubject                 2.0157     0.5257   3.834 0.000126 ***
## NPTypepronoun                     1.1208     0.3638   3.081 0.002065 ** 
## Extractionsubject:NPTypepronoun   0.4186     0.9802   0.427 0.669354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Extrct NPTypp
## Extrctnsbjc -0.198              
## NPTypepronn -0.176  0.098       
## Extrctn:NPT -0.072 -0.196 -0.419

For ANOVAs, conduct the analyses using the aov() function;

bonus points if you can also figure out how to do the same thing using a regression with lmer() or glmer(). (Note: for this question you’ll have to work out how to include random effects structure in an ANOVA in R, using the aov() function. This has a different syntax from lmer(). This is something that we didn’t cover in class, but using your Googling skills to hack it out is good practice navigating the R jungle.)

First, Jaeger mentioned that he discarded 2 items in the ANOVA analyses. We can see from below that items 3 and 5 are systematically missing.

d.item <- aggregate(CorrectN ~ itemby4 + Extraction + NPType, data=d, mean)

Hence we discard items 3 and 5.

df <- d[d$itemby4!=3 & d$itemby4!=5,]
df.subj <- aggregate(CorrectN ~ child + Extraction + NPType, data=df, mean)
df.item <- aggregate(CorrectN ~ itemby4 + Extraction + NPType, data=df, mean)

Using within-subject and within-items ANOVA, respectively

summary(aov(CorrectN ~ 1 + Extraction * NPType + Error(child / (Extraction * NPType)), data=df.subj))
## 
## Error: child
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 23  1.169 0.05084               
## 
## Error: child:Extraction
##            Df Sum Sq Mean Sq F value   Pr(>F)    
## Extraction  1 0.5056  0.5056   24.24 5.64e-05 ***
## Residuals  23 0.4797  0.0209                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: child:NPType
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## NPType     1 0.3000 0.30001   16.14 0.000538 ***
## Residuals 23 0.4275 0.01859                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: child:Extraction:NPType
##                   Df  Sum Sq Mean Sq F value  Pr(>F)   
## Extraction:NPType  1 0.09584 0.09584    9.75 0.00479 **
## Residuals         23 0.22610 0.00983                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(CorrectN ~ 1 + Extraction * NPType + Error(itemby4 / (Extraction * NPType)), data=df.item))
## 
## Error: itemby4
##           Df  Sum Sq Mean Sq F value Pr(>F)
## Residuals  5 0.05633 0.01127               
## 
## Error: itemby4:Extraction
##            Df  Sum Sq Mean Sq F value Pr(>F)  
## Extraction  1 0.12832 0.12832   10.21 0.0241 *
## Residuals   5 0.06285 0.01257                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: itemby4:NPType
##           Df  Sum Sq Mean Sq F value  Pr(>F)   
## NPType     1 0.07176 0.07176   19.67 0.00679 **
## Residuals  5 0.01824 0.00365                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: itemby4:Extraction:NPType
##                   Df  Sum Sq  Mean Sq F value Pr(>F)  
## Extraction:NPType  1 0.02569 0.025689   12.59 0.0164 *
## Residuals          5 0.01020 0.002041                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And for the combined analysis, the F-score \(F'=F_1 \times F_2 / (F_1 + F_2)\). For example, for the interaction term in Table 3, its F-value is

9.7 * 12.6 / (9.7 + 12.6)
## [1] 5.480717

Using within-subject ANOVA after arcsine-square-root transformations

trans.subj <- df.subj 
trans.subj$CorrectN <- asin(sqrt(df.subj$CorrectN))
summary(aov(CorrectN ~ 1 + Extraction * NPType + Error(child / (Extraction * NPType)), data=trans.subj))
## 
## Error: child
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 23  3.151   0.137               
## 
## Error: child:Extraction
##            Df Sum Sq Mean Sq F value   Pr(>F)    
## Extraction  1  1.682   1.682   28.48 2.03e-05 ***
## Residuals  23  1.358   0.059                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: child:NPType
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## NPType     1 0.8416  0.8416   17.28 0.000381 ***
## Residuals 23 1.1202  0.0487                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: child:Extraction:NPType
##                   Df Sum Sq Mean Sq F value  Pr(>F)   
## Extraction:NPType  1 0.2575 0.25752   8.513 0.00775 **
## Residuals         23 0.6958 0.03025                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The spurious interaction persists.

We can do this by applying anova to the corresponding linear mixed-effects models.

anova(lmer(CorrectN ~ Extraction*NPType + (1|child) + (1|Extraction:child) + (1|NPType:child), data=df.subj)  )
## Analysis of Variance Table
##                   Df   Sum Sq  Mean Sq F value
## Extraction         1 0.238286 0.238286 24.2397
## NPType             1 0.158677 0.158677 16.1414
## Extraction:NPType  1 0.095845 0.095845  9.7498
anova(lmer(CorrectN ~ Extraction*NPType + (1|itemby4) + (1|Extraction:itemby4) + (1|NPType:itemby4), data=df.item)  )
## Analysis of Variance Table
##                   Df   Sum Sq  Mean Sq F value
## Extraction         1 0.023515 0.023515  11.298
## NPType             1 0.042304 0.042304  20.325
## Extraction:NPType  1 0.025689 0.025689  12.342
anova(lmer(CorrectN ~ Extraction*NPType + (1|child) + (1|Extraction:child) + (1|NPType:child), data=trans.subj)  )
## Analysis of Variance Table
##                   Df  Sum Sq Mean Sq F value
## Extraction         1 0.86174 0.86174 28.4851
## NPType             1 0.52280 0.52280 17.2814
## Extraction:NPType  1 0.25752 0.25752  8.5125