Mixed effects models tutorial – linear regression

Set the path to your working directory. In case we save any output along the way, you’ll know where to find it. You can see your working directory with getwd().

setwd("/Users/titlis/cogsci/teaching/_2016/mem_tutorial/")

Load the languageR package. If it’s not yet installed you’ll get an error saying “Error in library(languageR) : there is no package called ‘languageR’”. To install the package, first type and execute install.packages("languageR"). (This generalizes to any package, using the name of the package instead of “languageR”.)

library(languageR)

This will also load the lexical decision time dataset from Baayen et al (2006), which we will be modeling extensively. To see two different summaries of the dataset and the first few lines:

summary(lexdec)

##     Subject           RT            Trial     Sex      NativeLanguage
##  A1     :  79   Min.   :5.829   Min.   : 23   F:1106   English:948   
##  A2     :  79   1st Qu.:6.215   1st Qu.: 64   M: 553   Other  :711   
##  A3     :  79   Median :6.346   Median :106                          
##  C      :  79   Mean   :6.385   Mean   :105                          
##  D      :  79   3rd Qu.:6.502   3rd Qu.:146                          
##  I      :  79   Max.   :7.587   Max.   :185                          
##  (Other):1185                                                        
##       Correct        PrevType      PrevCorrect          Word     
##  correct  :1594   nonword:855   correct  :1542   almond   :  21  
##  incorrect:  65   word   :804   incorrect: 117   ant      :  21  
##                                                  apple    :  21  
##                                                  apricot  :  21  
##                                                  asparagus:  21  
##                                                  avocado  :  21  
##                                                  (Other)  :1533  
##    Frequency       FamilySize      SynsetCount         Length      
##  Min.   :1.792   Min.   :0.0000   Min.   :0.6931   Min.   : 3.000  
##  1st Qu.:3.951   1st Qu.:0.0000   1st Qu.:1.0986   1st Qu.: 5.000  
##  Median :4.754   Median :0.0000   Median :1.0986   Median : 6.000  
##  Mean   :4.751   Mean   :0.7028   Mean   :1.3154   Mean   : 5.911  
##  3rd Qu.:5.652   3rd Qu.:1.0986   3rd Qu.:1.6094   3rd Qu.: 7.000  
##  Max.   :7.772   Max.   :3.3322   Max.   :2.3026   Max.   :10.000  
##                                                                    
##     Class      FreqSingular      FreqPlural     DerivEntropy   
##  animal:924   Min.   :   4.0   Min.   :  0.0   Min.   :0.0000  
##  plant :735   1st Qu.:  23.0   1st Qu.: 19.0   1st Qu.:0.0000  
##               Median :  69.0   Median : 49.0   Median :0.0370  
##               Mean   : 132.1   Mean   :109.7   Mean   :0.3856  
##               3rd Qu.: 146.0   3rd Qu.:132.0   3rd Qu.:0.6845  
##               Max.   :1518.0   Max.   :854.0   Max.   :2.2641  
##                                                                
##     Complex         rInfl             meanRT         SubjFreq    
##  complex: 210   Min.   :-1.3437   Min.   :6.245   Min.   :2.000  
##  simplex:1449   1st Qu.:-0.3023   1st Qu.:6.322   1st Qu.:3.160  
##                 Median : 0.1900   Median :6.364   Median :3.880  
##                 Mean   : 0.2845   Mean   :6.379   Mean   :3.911  
##                 3rd Qu.: 0.6385   3rd Qu.:6.420   3rd Qu.:4.680  
##                 Max.   : 4.4427   Max.   :6.621   Max.   :6.040  
##                                                                  
##     meanSize       meanWeight          BNCw               BNCc        
##  Min.   :1.323   Min.   :0.8244   Min.   : 0.02229   Min.   : 0.0000  
##  1st Qu.:1.890   1st Qu.:1.4590   1st Qu.: 1.64921   1st Qu.: 0.1625  
##  Median :3.099   Median :2.7558   Median : 3.32071   Median : 0.6500  
##  Mean   :2.891   Mean   :2.5516   Mean   : 7.37800   Mean   : 5.0351  
##  3rd Qu.:3.711   3rd Qu.:3.4178   3rd Qu.: 7.10943   3rd Qu.: 2.9248  
##  Max.   :4.819   Max.   :4.7138   Max.   :79.17324   Max.   :83.1949  
##                                                                       
##       BNCd           BNCcRatio         BNCdRatio     
##  Min.   :  0.000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:  1.188   1st Qu.:0.09673   1st Qu.:0.5551  
##  Median :  3.800   Median :0.27341   Median :0.9349  
##  Mean   : 12.995   Mean   :0.45834   Mean   :1.5428  
##  3rd Qu.: 10.451   3rd Qu.:0.55550   3rd Qu.:2.1315  
##  Max.   :241.561   Max.   :8.29545   Max.   :6.3458  
##

str(lexdec)

## 'data.frame':    1659 obs. of  28 variables:
##  $ Subject       : Factor w/ 21 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ RT            : num  6.34 6.31 6.35 6.19 6.03 ...
##  $ Trial         : int  23 27 29 30 32 33 34 38 41 42 ...
##  $ Sex           : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NativeLanguage: Factor w/ 2 levels "English","Other": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Correct       : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PrevType      : Factor w/ 2 levels "nonword","word": 2 1 1 2 1 2 2 1 1 2 ...
##  $ PrevCorrect   : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Word          : Factor w/ 79 levels "almond","ant",..: 55 47 20 58 25 12 71 69 62 1 ...
##  $ Frequency     : num  4.86 4.61 5 4.73 7.67 ...
##  $ FamilySize    : num  1.386 1.099 0.693 0 3.135 ...
##  $ SynsetCount   : num  0.693 1.946 1.609 1.099 2.079 ...
##  $ Length        : int  3 4 6 4 3 10 10 8 6 6 ...
##  $ Class         : Factor w/ 2 levels "animal","plant": 1 1 2 2 1 2 2 1 2 2 ...
##  $ FreqSingular  : int  54 69 83 44 1233 26 50 63 11 24 ...
##  $ FreqPlural    : int  74 30 49 68 828 31 65 47 9 42 ...
##  $ DerivEntropy  : num  0.791 0.697 0.475 0 1.213 ...
##  $ Complex       : Factor w/ 2 levels "complex","simplex": 2 2 2 2 2 1 1 2 2 2 ...
##  $ rInfl         : num  -0.31 0.815 0.519 -0.427 0.398 ...
##  $ meanRT        : num  6.36 6.42 6.34 6.34 6.3 ...
##  $ SubjFreq      : num  3.12 2.4 3.88 4.52 6.04 3.28 5.04 2.8 3.12 3.72 ...
##  $ meanSize      : num  3.48 3 1.63 1.99 4.64 ...
##  $ meanWeight    : num  3.18 2.61 1.21 1.61 4.52 ...
##  $ BNCw          : num  12.06 5.74 5.72 2.05 74.84 ...
##  $ BNCc          : num  0 4.06 3.25 1.46 50.86 ...
##  $ BNCd          : num  6.18 2.85 12.59 7.36 241.56 ...
##  $ BNCcRatio     : num  0 0.708 0.568 0.713 0.68 ...
##  $ BNCdRatio     : num  0.512 0.497 2.202 3.591 3.228 ...

head(lexdec)

##   Subject       RT Trial Sex NativeLanguage Correct PrevType PrevCorrect
## 1      A1 6.340359    23   F        English correct     word     correct
## 2      A1 6.308098    27   F        English correct  nonword     correct
## 3      A1 6.349139    29   F        English correct  nonword     correct
## 4      A1 6.186209    30   F        English correct     word     correct
## 5      A1 6.025866    32   F        English correct  nonword     correct
## 6      A1 6.180017    33   F        English correct     word     correct
##         Word Frequency FamilySize SynsetCount Length  Class FreqSingular
## 1        owl  4.859812  1.3862944   0.6931472      3 animal           54
## 2       mole  4.605170  1.0986123   1.9459101      4 animal           69
## 3     cherry  4.997212  0.6931472   1.6094379      6  plant           83
## 4       pear  4.727388  0.0000000   1.0986123      4  plant           44
## 5        dog  7.667626  3.1354942   2.0794415      3 animal         1233
## 6 blackberry  4.060443  0.6931472   1.3862944     10  plant           26
##   FreqPlural DerivEntropy Complex      rInfl meanRT SubjFreq meanSize
## 1         74       0.7912 simplex -0.3101549 6.3582     3.12   3.4758
## 2         30       0.6968 simplex  0.8145080 6.4150     2.40   2.9999
## 3         49       0.4754 simplex  0.5187938 6.3426     3.88   1.6278
## 4         68       0.0000 simplex -0.4274440 6.3353     4.52   1.9908
## 5        828       1.2129 simplex  0.3977961 6.2956     6.04   4.6429
## 6         31       0.3492 complex -0.1698990 6.3959     3.28   1.5831
##   meanWeight      BNCw      BNCc       BNCd BNCcRatio BNCdRatio
## 1     3.1806 12.057065  0.000000   6.175602  0.000000  0.512198
## 2     2.6112  5.738806  4.062251   2.850278  0.707856  0.496667
## 3     1.2081  5.716520  3.249801  12.588727  0.568493  2.202166
## 4     1.6114  2.050370  1.462410   7.363218  0.713242  3.591166
## 5     4.5167 74.838494 50.859385 241.561040  0.679589  3.227765
## 6     1.1365  1.270338  0.162490   1.187616  0.127911  0.934882

To get information about the dataset provided by the authors:

?lexdec

We are interested in modeling response times (coded in the data frame as ‘RT’). The first step is always to understand some basic things about your data.

How many data points are there?

nrow(lexdec)

## [1] 1659

How many unique participants are there?

length(levels(lexdec$Subject))

## [1] 21

What is the mean, minimum, maximum, and standard deviation of the response times?

mean(lexdec$RT)

## [1] 6.38509

min(lexdec$RT)

## [1] 5.828946

max(lexdec$RT)

## [1] 7.587311

sd(lexdec$RT)

## [1] 0.2415091

Let’s start with a simple model – see slides. We start by asking whether frequency has a linear effect on log RTs:

m = lm(RT ~ Frequency, data=lexdec)
summary(m)

## 
## Call:
## lm(formula = RT ~ Frequency, data = lexdec)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55407 -0.16153 -0.03494  0.11699  1.08768 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.588778   0.022296 295.515   <2e-16 ***
## Frequency   -0.042872   0.004533  -9.459   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2353 on 1657 degrees of freedom
## Multiple R-squared:  0.05123,    Adjusted R-squared:  0.05066 
## F-statistic: 89.47 on 1 and 1657 DF,  p-value: < 2.2e-16

Extend the simple model to include an additional predictor for morphological family size. If you don’t remember the name of the family size column in the model, use the names command.

names(lexdec)

##  [1] "Subject"        "RT"             "Trial"          "Sex"           
##  [5] "NativeLanguage" "Correct"        "PrevType"       "PrevCorrect"   
##  [9] "Word"           "Frequency"      "FamilySize"     "SynsetCount"   
## [13] "Length"         "Class"          "FreqSingular"   "FreqPlural"    
## [17] "DerivEntropy"   "Complex"        "rInfl"          "meanRT"        
## [21] "SubjFreq"       "meanSize"       "meanWeight"     "BNCw"          
## [25] "BNCc"           "BNCd"           "BNCcRatio"      "BNCdRatio"

m = lm(RT ~ Frequency + FamilySize, data=lexdec)
summary(m)

## 
## Call:
## lm(formula = RT ~ Frequency + FamilySize, data = lexdec)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55100 -0.16080 -0.03438  0.12044  1.09688 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.563853   0.026826 244.685  < 2e-16 ***
## Frequency   -0.035310   0.006407  -5.511 4.13e-08 ***
## FamilySize  -0.015655   0.009380  -1.669   0.0953 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2352 on 1656 degrees of freedom
## Multiple R-squared:  0.05282,    Adjusted R-squared:  0.05168 
## F-statistic: 46.17 on 2 and 1656 DF,  p-value: < 2.2e-16

Extend the model to include a predictor for participants’ native language (English vs other). By default R dummy-codes categorical predictors. It assigns 0 and 1 to the predictors in alphabetical order. If you’re not sure how a predictor is coded (or if you want to change the default coding), you can use the contrasts() function.

m = lm(RT ~ Frequency + FamilySize + NativeLanguage, data=lexdec)
summary(m)

## 
## Call:
## lm(formula = RT ~ Frequency + FamilySize + NativeLanguage, data = lexdec)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64004 -0.14772 -0.02987  0.11492  1.05382 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.497073   0.025784 251.977  < 2e-16 ***
## Frequency           -0.035310   0.006054  -5.832 6.56e-09 ***
## FamilySize          -0.015655   0.008863  -1.766   0.0775 .  
## NativeLanguageOther  0.155821   0.011025  14.133  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2222 on 1655 degrees of freedom
## Multiple R-squared:  0.1548, Adjusted R-squared:  0.1533 
## F-statistic: 101.1 on 3 and 1655 DF,  p-value: < 2.2e-16

contrasts(lexdec$NativeLanguage)

##         Other
## English     0
## Other       1

Extend the model to include the interaction between frequency and native language.

m = lm(RT ~ Frequency + FamilySize + NativeLanguage + Frequency:NativeLanguage, data=lexdec)
m = lm(RT ~ FamilySize + Frequency*NativeLanguage, data=lexdec)
summary(m)

## 
## Call:
## lm(formula = RT ~ FamilySize + Frequency * NativeLanguage, data = lexdec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6693 -0.1492 -0.0280  0.1163  1.0679 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    6.441135   0.031140 206.847  < 2e-16 ***
## FamilySize                    -0.015655   0.008839  -1.771 0.076726 .  
## Frequency                     -0.023536   0.007079  -3.325 0.000905 ***
## NativeLanguageOther            0.286343   0.042432   6.748 2.06e-11 ***
## Frequency:NativeLanguageOther -0.027472   0.008626  -3.185 0.001475 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2216 on 1654 degrees of freedom
## Multiple R-squared:   0.16,  Adjusted R-squared:  0.1579 
## F-statistic: 78.75 on 4 and 1654 DF,  p-value: < 2.2e-16

Simple effects analysis of the interaction: the slope is more negative for L2 speakers than for native speakers.

m = lm(RT ~ FamilySize + Frequency*NativeLanguage - Frequency, data=lexdec)
summary(m)

## 
## Call:
## lm(formula = RT ~ FamilySize + Frequency * NativeLanguage - Frequency, 
##     data = lexdec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6693 -0.1492 -0.0280  0.1163  1.0679 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      6.441135   0.031140 206.847  < 2e-16 ***
## FamilySize                      -0.015655   0.008839  -1.771 0.076726 .  
## NativeLanguageOther              0.286343   0.042432   6.748 2.06e-11 ***
## Frequency:NativeLanguageEnglish -0.023536   0.007079  -3.325 0.000905 ***
## Frequency:NativeLanguageOther   -0.051008   0.007794  -6.545 7.94e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2216 on 1654 degrees of freedom
## Multiple R-squared:   0.16,  Adjusted R-squared:  0.1579 
## F-statistic: 78.75 on 4 and 1654 DF,  p-value: < 2.2e-16

Mixed effects models tutorial – linear regression

jdegen

Sep 16, 2016