Set the path to your working directory. In case we save any output along the way, you’ll know where to find it. You can see your working directory with getwd().
setwd("/Users/titlis/cogsci/teaching/_2016/mem_tutorial/")
Load the languageR package. If it’s not yet installed you’ll get an error saying “Error in library(languageR) : there is no package called ‘languageR’”. To install the package, first type and execute install.packages("languageR"). (This generalizes to any package, using the name of the package instead of “languageR”.)
library(languageR)
This will also load the lexical decision time dataset from Baayen et al (2006), which we will be modeling extensively. To see two different summaries of the dataset and the first few lines:
summary(lexdec)
## Subject RT Trial Sex NativeLanguage
## A1 : 79 Min. :5.829 Min. : 23 F:1106 English:948
## A2 : 79 1st Qu.:6.215 1st Qu.: 64 M: 553 Other :711
## A3 : 79 Median :6.346 Median :106
## C : 79 Mean :6.385 Mean :105
## D : 79 3rd Qu.:6.502 3rd Qu.:146
## I : 79 Max. :7.587 Max. :185
## (Other):1185
## Correct PrevType PrevCorrect Word
## correct :1594 nonword:855 correct :1542 almond : 21
## incorrect: 65 word :804 incorrect: 117 ant : 21
## apple : 21
## apricot : 21
## asparagus: 21
## avocado : 21
## (Other) :1533
## Frequency FamilySize SynsetCount Length
## Min. :1.792 Min. :0.0000 Min. :0.6931 Min. : 3.000
## 1st Qu.:3.951 1st Qu.:0.0000 1st Qu.:1.0986 1st Qu.: 5.000
## Median :4.754 Median :0.0000 Median :1.0986 Median : 6.000
## Mean :4.751 Mean :0.7028 Mean :1.3154 Mean : 5.911
## 3rd Qu.:5.652 3rd Qu.:1.0986 3rd Qu.:1.6094 3rd Qu.: 7.000
## Max. :7.772 Max. :3.3322 Max. :2.3026 Max. :10.000
##
## Class FreqSingular FreqPlural DerivEntropy
## animal:924 Min. : 4.0 Min. : 0.0 Min. :0.0000
## plant :735 1st Qu.: 23.0 1st Qu.: 19.0 1st Qu.:0.0000
## Median : 69.0 Median : 49.0 Median :0.0370
## Mean : 132.1 Mean :109.7 Mean :0.3856
## 3rd Qu.: 146.0 3rd Qu.:132.0 3rd Qu.:0.6845
## Max. :1518.0 Max. :854.0 Max. :2.2641
##
## Complex rInfl meanRT SubjFreq
## complex: 210 Min. :-1.3437 Min. :6.245 Min. :2.000
## simplex:1449 1st Qu.:-0.3023 1st Qu.:6.322 1st Qu.:3.160
## Median : 0.1900 Median :6.364 Median :3.880
## Mean : 0.2845 Mean :6.379 Mean :3.911
## 3rd Qu.: 0.6385 3rd Qu.:6.420 3rd Qu.:4.680
## Max. : 4.4427 Max. :6.621 Max. :6.040
##
## meanSize meanWeight BNCw BNCc
## Min. :1.323 Min. :0.8244 Min. : 0.02229 Min. : 0.0000
## 1st Qu.:1.890 1st Qu.:1.4590 1st Qu.: 1.64921 1st Qu.: 0.1625
## Median :3.099 Median :2.7558 Median : 3.32071 Median : 0.6500
## Mean :2.891 Mean :2.5516 Mean : 7.37800 Mean : 5.0351
## 3rd Qu.:3.711 3rd Qu.:3.4178 3rd Qu.: 7.10943 3rd Qu.: 2.9248
## Max. :4.819 Max. :4.7138 Max. :79.17324 Max. :83.1949
##
## BNCd BNCcRatio BNCdRatio
## Min. : 0.000 Min. :0.00000 Min. :0.0000
## 1st Qu.: 1.188 1st Qu.:0.09673 1st Qu.:0.5551
## Median : 3.800 Median :0.27341 Median :0.9349
## Mean : 12.995 Mean :0.45834 Mean :1.5428
## 3rd Qu.: 10.451 3rd Qu.:0.55550 3rd Qu.:2.1315
## Max. :241.561 Max. :8.29545 Max. :6.3458
##
str(lexdec)
## 'data.frame': 1659 obs. of 28 variables:
## $ Subject : Factor w/ 21 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ RT : num 6.34 6.31 6.35 6.19 6.03 ...
## $ Trial : int 23 27 29 30 32 33 34 38 41 42 ...
## $ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ NativeLanguage: Factor w/ 2 levels "English","Other": 1 1 1 1 1 1 1 1 1 1 ...
## $ Correct : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
## $ PrevType : Factor w/ 2 levels "nonword","word": 2 1 1 2 1 2 2 1 1 2 ...
## $ PrevCorrect : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
## $ Word : Factor w/ 79 levels "almond","ant",..: 55 47 20 58 25 12 71 69 62 1 ...
## $ Frequency : num 4.86 4.61 5 4.73 7.67 ...
## $ FamilySize : num 1.386 1.099 0.693 0 3.135 ...
## $ SynsetCount : num 0.693 1.946 1.609 1.099 2.079 ...
## $ Length : int 3 4 6 4 3 10 10 8 6 6 ...
## $ Class : Factor w/ 2 levels "animal","plant": 1 1 2 2 1 2 2 1 2 2 ...
## $ FreqSingular : int 54 69 83 44 1233 26 50 63 11 24 ...
## $ FreqPlural : int 74 30 49 68 828 31 65 47 9 42 ...
## $ DerivEntropy : num 0.791 0.697 0.475 0 1.213 ...
## $ Complex : Factor w/ 2 levels "complex","simplex": 2 2 2 2 2 1 1 2 2 2 ...
## $ rInfl : num -0.31 0.815 0.519 -0.427 0.398 ...
## $ meanRT : num 6.36 6.42 6.34 6.34 6.3 ...
## $ SubjFreq : num 3.12 2.4 3.88 4.52 6.04 3.28 5.04 2.8 3.12 3.72 ...
## $ meanSize : num 3.48 3 1.63 1.99 4.64 ...
## $ meanWeight : num 3.18 2.61 1.21 1.61 4.52 ...
## $ BNCw : num 12.06 5.74 5.72 2.05 74.84 ...
## $ BNCc : num 0 4.06 3.25 1.46 50.86 ...
## $ BNCd : num 6.18 2.85 12.59 7.36 241.56 ...
## $ BNCcRatio : num 0 0.708 0.568 0.713 0.68 ...
## $ BNCdRatio : num 0.512 0.497 2.202 3.591 3.228 ...
head(lexdec)
## Subject RT Trial Sex NativeLanguage Correct PrevType PrevCorrect
## 1 A1 6.340359 23 F English correct word correct
## 2 A1 6.308098 27 F English correct nonword correct
## 3 A1 6.349139 29 F English correct nonword correct
## 4 A1 6.186209 30 F English correct word correct
## 5 A1 6.025866 32 F English correct nonword correct
## 6 A1 6.180017 33 F English correct word correct
## Word Frequency FamilySize SynsetCount Length Class FreqSingular
## 1 owl 4.859812 1.3862944 0.6931472 3 animal 54
## 2 mole 4.605170 1.0986123 1.9459101 4 animal 69
## 3 cherry 4.997212 0.6931472 1.6094379 6 plant 83
## 4 pear 4.727388 0.0000000 1.0986123 4 plant 44
## 5 dog 7.667626 3.1354942 2.0794415 3 animal 1233
## 6 blackberry 4.060443 0.6931472 1.3862944 10 plant 26
## FreqPlural DerivEntropy Complex rInfl meanRT SubjFreq meanSize
## 1 74 0.7912 simplex -0.3101549 6.3582 3.12 3.4758
## 2 30 0.6968 simplex 0.8145080 6.4150 2.40 2.9999
## 3 49 0.4754 simplex 0.5187938 6.3426 3.88 1.6278
## 4 68 0.0000 simplex -0.4274440 6.3353 4.52 1.9908
## 5 828 1.2129 simplex 0.3977961 6.2956 6.04 4.6429
## 6 31 0.3492 complex -0.1698990 6.3959 3.28 1.5831
## meanWeight BNCw BNCc BNCd BNCcRatio BNCdRatio
## 1 3.1806 12.057065 0.000000 6.175602 0.000000 0.512198
## 2 2.6112 5.738806 4.062251 2.850278 0.707856 0.496667
## 3 1.2081 5.716520 3.249801 12.588727 0.568493 2.202166
## 4 1.6114 2.050370 1.462410 7.363218 0.713242 3.591166
## 5 4.5167 74.838494 50.859385 241.561040 0.679589 3.227765
## 6 1.1365 1.270338 0.162490 1.187616 0.127911 0.934882
To get information about the dataset provided by the authors:
?lexdec
We are interested in modeling response times (coded in the data frame as ‘RT’). The first step is always to understand some basic things about your data.
nrow(lexdec)
## [1] 1659
length(levels(lexdec$Subject))
## [1] 21
mean(lexdec$RT)
## [1] 6.38509
min(lexdec$RT)
## [1] 5.828946
max(lexdec$RT)
## [1] 7.587311
sd(lexdec$RT)
## [1] 0.2415091
Let’s start with a simple model – see slides. We start by asking whether frequency has a linear effect on log RTs:
m = lm(RT ~ Frequency, data=lexdec)
summary(m)
##
## Call:
## lm(formula = RT ~ Frequency, data = lexdec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55407 -0.16153 -0.03494 0.11699 1.08768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.588778 0.022296 295.515 <2e-16 ***
## Frequency -0.042872 0.004533 -9.459 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2353 on 1657 degrees of freedom
## Multiple R-squared: 0.05123, Adjusted R-squared: 0.05066
## F-statistic: 89.47 on 1 and 1657 DF, p-value: < 2.2e-16
Extend the simple model to include an additional predictor for morphological family size. If you don’t remember the name of the family size column in the model, use the names command.
names(lexdec)
## [1] "Subject" "RT" "Trial" "Sex"
## [5] "NativeLanguage" "Correct" "PrevType" "PrevCorrect"
## [9] "Word" "Frequency" "FamilySize" "SynsetCount"
## [13] "Length" "Class" "FreqSingular" "FreqPlural"
## [17] "DerivEntropy" "Complex" "rInfl" "meanRT"
## [21] "SubjFreq" "meanSize" "meanWeight" "BNCw"
## [25] "BNCc" "BNCd" "BNCcRatio" "BNCdRatio"
m = lm(RT ~ Frequency + FamilySize, data=lexdec)
summary(m)
##
## Call:
## lm(formula = RT ~ Frequency + FamilySize, data = lexdec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55100 -0.16080 -0.03438 0.12044 1.09688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.563853 0.026826 244.685 < 2e-16 ***
## Frequency -0.035310 0.006407 -5.511 4.13e-08 ***
## FamilySize -0.015655 0.009380 -1.669 0.0953 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2352 on 1656 degrees of freedom
## Multiple R-squared: 0.05282, Adjusted R-squared: 0.05168
## F-statistic: 46.17 on 2 and 1656 DF, p-value: < 2.2e-16
Extend the model to include a predictor for participants’ native language (English vs other). By default R dummy-codes categorical predictors. It assigns 0 and 1 to the predictors in alphabetical order. If you’re not sure how a predictor is coded (or if you want to change the default coding), you can use the contrasts() function.
m = lm(RT ~ Frequency + FamilySize + NativeLanguage, data=lexdec)
summary(m)
##
## Call:
## lm(formula = RT ~ Frequency + FamilySize + NativeLanguage, data = lexdec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64004 -0.14772 -0.02987 0.11492 1.05382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.497073 0.025784 251.977 < 2e-16 ***
## Frequency -0.035310 0.006054 -5.832 6.56e-09 ***
## FamilySize -0.015655 0.008863 -1.766 0.0775 .
## NativeLanguageOther 0.155821 0.011025 14.133 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2222 on 1655 degrees of freedom
## Multiple R-squared: 0.1548, Adjusted R-squared: 0.1533
## F-statistic: 101.1 on 3 and 1655 DF, p-value: < 2.2e-16
contrasts(lexdec$NativeLanguage)
## Other
## English 0
## Other 1
Extend the model to include the interaction between frequency and native language.
m = lm(RT ~ Frequency + FamilySize + NativeLanguage + Frequency:NativeLanguage, data=lexdec)
m = lm(RT ~ FamilySize + Frequency*NativeLanguage, data=lexdec)
summary(m)
##
## Call:
## lm(formula = RT ~ FamilySize + Frequency * NativeLanguage, data = lexdec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6693 -0.1492 -0.0280 0.1163 1.0679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.441135 0.031140 206.847 < 2e-16 ***
## FamilySize -0.015655 0.008839 -1.771 0.076726 .
## Frequency -0.023536 0.007079 -3.325 0.000905 ***
## NativeLanguageOther 0.286343 0.042432 6.748 2.06e-11 ***
## Frequency:NativeLanguageOther -0.027472 0.008626 -3.185 0.001475 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2216 on 1654 degrees of freedom
## Multiple R-squared: 0.16, Adjusted R-squared: 0.1579
## F-statistic: 78.75 on 4 and 1654 DF, p-value: < 2.2e-16
Simple effects analysis of the interaction: the slope is more negative for L2 speakers than for native speakers.
m = lm(RT ~ FamilySize + Frequency*NativeLanguage - Frequency, data=lexdec)
summary(m)
##
## Call:
## lm(formula = RT ~ FamilySize + Frequency * NativeLanguage - Frequency,
## data = lexdec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6693 -0.1492 -0.0280 0.1163 1.0679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.441135 0.031140 206.847 < 2e-16 ***
## FamilySize -0.015655 0.008839 -1.771 0.076726 .
## NativeLanguageOther 0.286343 0.042432 6.748 2.06e-11 ***
## Frequency:NativeLanguageEnglish -0.023536 0.007079 -3.325 0.000905 ***
## Frequency:NativeLanguageOther -0.051008 0.007794 -6.545 7.94e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2216 on 1654 degrees of freedom
## Multiple R-squared: 0.16, Adjusted R-squared: 0.1579
## F-statistic: 78.75 on 4 and 1654 DF, p-value: < 2.2e-16