The english dataset in the languageR library contains “mean visual lexical decision latencies and word naming latencies to 2284 monomorphemic English nouns and verbs, averaged for old and young subjects, with various predictor variables.” Use ?english and read the detailed information about this dataset.
library(languageR)
head(english)
## RTlexdec RTnaming Familiarity Word AgeSubject WordCategory
## 1 6.543754 6.145044 2.37 doe young N
## 2 6.397596 6.246882 4.43 whore young N
## 3 6.304942 6.143756 5.60 stress young N
## 4 6.424221 6.131878 3.87 pork young N
## 5 6.450597 6.198479 3.93 plug young N
## 6 6.531970 6.167726 3.27 prop young N
## WrittenFrequency WrittenSpokenFrequencyRatio FamilySize
## 1 3.912023 1.0216512 1.386294
## 2 4.521789 0.3504830 1.386294
## 3 6.505784 2.0893560 1.609438
## 4 5.017280 -0.5263339 1.945910
## 5 4.890349 -1.0445451 2.197225
## 6 4.770685 0.9248014 1.386294
## DerivationalEntropy InflectionalEntropy NumberSimplexSynsets
## 1 0.14144 0.02114 0.6931472
## 2 0.42706 0.94198 1.0986123
## 3 0.06197 1.44339 2.4849066
## 4 0.43035 0.00000 1.0986123
## 5 0.35920 1.75393 2.4849066
## 6 0.06268 1.74730 1.6094379
## NumberComplexSynsets LengthInLetters Ncount MeanBigramFrequency
## 1 0.000000 3 8 7.036333
## 2 0.000000 5 5 9.537878
## 3 1.945910 6 0 9.883931
## 4 2.639057 4 8 8.309180
## 5 2.484907 4 3 7.943717
## 6 1.386294 4 9 8.349620
## FrequencyInitialDiphone ConspelV ConspelN ConphonV ConphonN ConfriendsV
## 1 12.02268 10 3.737670 41 8.837826 8
## 2 12.59780 20 7.870930 38 9.775825 20
## 3 13.30069 10 6.693324 13 7.040536 10
## 4 12.07807 5 6.677083 6 3.828641 4
## 5 11.92678 17 4.762174 17 4.762174 17
## 6 12.19724 19 6.234411 21 6.249975 19
## ConfriendsN ConffV ConffN ConfbV ConfbN NounFrequency
## 1 3.295837 0.6931472 2.708050 3.496508 8.833900 49
## 2 7.870930 0.0000000 0.000000 2.944439 9.614738 142
## 3 6.693324 0.0000000 0.000000 1.386294 5.817111 565
## 4 3.526361 0.6931472 6.634633 1.098612 2.564949 150
## 5 4.762174 0.0000000 0.000000 0.000000 0.000000 170
## 6 6.234411 0.0000000 0.000000 1.098612 2.197225 125
## VerbFrequency CV Obstruent Frication Voice
## 1 0 C obst burst voiced
## 2 0 C obst frication voiceless
## 3 473 C obst frication voiceless
## 4 0 C obst burst voiceless
## 5 120 C obst burst voiceless
## 6 280 C obst burst voiceless
## FrequencyInitialDiphoneWord FrequencyInitialDiphoneSyllable
## 1 10.129308 10.409763
## 2 9.054388 9.148252
## 3 12.422026 13.127395
## 4 10.048151 11.003649
## 5 11.796336 12.163092
## 6 11.991567 12.436772
## CorrectLexdec
## 1 27
## 2 30
## 3 30
## 4 30
## 5 26
## 6 28
english[1,]
## RTlexdec RTnaming Familiarity Word AgeSubject WordCategory
## 1 6.543754 6.145044 2.37 doe young N
## WrittenFrequency WrittenSpokenFrequencyRatio FamilySize
## 1 3.912023 1.021651 1.386294
## DerivationalEntropy InflectionalEntropy NumberSimplexSynsets
## 1 0.14144 0.02114 0.6931472
## NumberComplexSynsets LengthInLetters Ncount MeanBigramFrequency
## 1 0 3 8 7.036333
## FrequencyInitialDiphone ConspelV ConspelN ConphonV ConphonN ConfriendsV
## 1 12.02268 10 3.73767 41 8.837826 8
## ConfriendsN ConffV ConffN ConfbV ConfbN NounFrequency
## 1 3.295837 0.6931472 2.70805 3.496508 8.8339 49
## VerbFrequency CV Obstruent Frication Voice FrequencyInitialDiphoneWord
## 1 0 C obst burst voiced 10.12931
## FrequencyInitialDiphoneSyllable CorrectLexdec
## 1 10.40976 27
The first row, indexed by english[1,], tells us how long, in log milliseconds, participants in the “young” category took on average to decide whether “doe” (word[1,]$Word) is a word. The average was \(e^{6.5437536} = 694.89\) ms. The dataset also includes lots of interesting additional information, for example that english[1,]$WrittenFrequency - the frequency of “doe” in the CELEX lexical database - is \(e^{3.912023} = 50\) words per million.
cor(english$RTlexdec, english$WrittenFrequency)
## [1] -0.434815
density()) and use curve(dnorm(x, mean = ..., sd= ...), add=TRUE) to overlay a normal approximation computed using the sample mean and standard deviation.plot(density(english$RTlexdec), main="density of log lexical decision time", col="black")
curve(dnorm(x, mean=mean(english$RTlexdec), sd=sd(english$RTlexdec)), add=TRUE, lty=2, col="black")
legend("topright",
legend=c("density", "normal approximation"),
lty=c(1,2),
col=c("black", "black"))
We can see that although the normal approximation is rather close to the density plot, the density seems to be bimodal, which is not captured by the normal approximation.
The variable representing the age of the participants from whom the observations are gathered comes in two values, old and young. We can see this by typing levels(english$AgeSubject). We’ll divide the data into two parts, old and young.
levels(english$AgeSubject)
## [1] "old" "young"
old = subset(english, AgeSubject == 'old')
young = subset(english, AgeSubject == 'young')
plot(density(english$RTlexdec), main="densities of log lexical decision time ", col="black", ylim=c(0,4))
curve(dnorm(x, mean=mean(english$RTlexdec), sd=sd(english$RTlexdec)), add=TRUE, lty=2, col="black")
lines(density(old$RTlexdec), main="density of log lexical decision time", col="blue")
curve(dnorm(x, mean=mean(old$RTlexdec), sd=sd(old$RTlexdec)), add=TRUE, lty=2, col="blue")
lines(density(young$RTlexdec), col="orange")
curve(dnorm(x, mean=mean(young$RTlexdec), sd=sd(young$RTlexdec)), add=TRUE, lty=2, col="orange")
legend("topright",
legend=c("density:all", "normal approx:all",
"density:old", "normal approx:old",
"density:young", "normal approx:young"),
lty=c(1,2,1,2,1,2),
col=c("black", "black", "blue", "blue","red", "red"))
We can see that for each subset, the density seems to be right-skewed, in contrast with its symmetric normal approximation (whose mode is to the right of that of the density curve). But at least the density of each subset has only one mode, and we can see that although the normal approximation of each subset has its mode to the right of that of the density curve, the model of the normal approximation falls between the two modes in the density of the entire dataset, making the normal approximation appear to fit the density relatively closely.
The order of words is the same in old and young. (Try typing old$Word == young$Word to verify this.) So, we can find the difference between log mean lexical decision time in the “old” and “young” groups by simply subtracting one from the other:
rt.diff = old$RTlexdec - young$RTlexdec
rt.diff? Plot its kernel density and overlap a normal approximation. How good is the normal approximation?mean(rt.diff)
## [1] 0.2217215
sd(rt.diff)
## [1] 0.091446
plot(density(rt.diff), lty=1, main="difference in logRT between old and young")
curve(dnorm(x, mean=mean(rt.diff), sd=sd(rt.diff)), add=TRUE, lty=2)
legend("topright",
legend=c("density:all", "normal approx:all"),
lty=c(1,2),
col=c("black", "black"))
We can see that the fit of the normal approximation is quite good, although the density seems slightly more centered to the mean than the normal approximation.
resample.meandiff <- function(i){
return(mean(sample(rt.diff, size=length(rt.diff), replace=TRUE)))
}
bootstrap.diff.samples <- sapply(1:1000, resample.meandiff)
bootstrap.ci <- quantile(bootstrap.diff.samples, prob=c(.025, .975))
bootstrap.ci
## 2.5% 97.5%
## 0.2178804 0.2253753
and (b) a \(t\)-test. (Use ?t.test to get information on the latter.)
t.test(rt.diff)$conf.int
## [1] 0.2179692 0.2254737
## attr(,"conf.level")
## [1] 0.95
We can see that the two confidence intervals are very close.
In general, when the distribution is not approximately normal, bootstrapping, being a non-parametric method, should be preferred.
t.test(rt.diff)
##
## One Sample t-test
##
## data: rt.diff
## t = 115.8753, df = 2283, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.2179692 0.2254737
## sample estimates:
## mean of x
## 0.2217215
We can see that for the \(t\)-test, the \(p\)-value is less than 2.2e-16, which is very small. In fact, it is so small that it will display 0 when we directly access it.
t.test(rt.diff)$p.value
## [1] 0
This means that it is statistically significant that the difference is not \(0\). However, the \(p\)-value says nothing about how much (i.e., the effect size) the difference is different from \(0\). We can see that the difference is estimated to be
t.test(rt.diff)$estimate
## mean of x
## 0.2217215
Whether a difference of such a size is big enough to be of any interest (“significant” in common usage) is independent of the \(p\)-value.
The \(p\)-value is the smallest (non-negative) number such that the \((1-p)\)-confidence interval of the mean difference constructed from the observed data does not contain \(0\) (the null hypothesis).
The method to construct a Frequentist \((1-\alpha)\)-confidence interval, by definition, ensures that if we were to sample again and again from the population data of the same size and apply the method to construct confidence intervals, \((1-\alpha)\) of the time the CI will contain the true population mean difference, which under the null hypothesis is \(0\). This means that, under the null hypothesis that the true population mean difference is \(0\), that the \((1-\alpha)\)-confidence interval constructed from the observed data does not contain \(0\) should happen only \(\alpha\) of the time. Therefore if we reject the null hypothesis on the basis that the \((1-\alpha)\)-confidence interval does not contain \(0\), only \(\alpha\) of the time will we be wrong (i.e., make a type I error).
Now, \(p\)-value is by definition the smallest \(\alpha\) for which the null hypothesis will be rejected, and we know from above that the null hypothesis is rejected when the \((1-\alpha)\)-confidence interval constructed from the observed data does not contain \(0\). Therefore, \(p\)-value is the smallest number such that the \((1-p)\)-confidence interval of the mean difference constructed from the observed data does not contain \(0\).
Calculating the \(p\)-value from bootstrapping is tricky, since we have to somehow use the empirical distribution to simulate the population under the null hypothesis that the mean difference is 0. One way is to first subtract each of the observed data by the sample mean, then do bootstrapping, and then calculate the \(p\)-value as the probability of observed data or more extreme ones. However, since we substracted a constant from each of the observed data, by linear transformation the above is equivalent to the probability of observing mean difference more extreme than twice the sample mean (under the null hypothesis that the population mean is the sample mean). We can calculate it as below
mean(bootstrap.diff.samples<=0 | bootstrap.diff.samples>=2*mean(rt.diff))
## [1] 0
Alternatively, we can use the conclusion that \(p\)-value is the smallest (non-negative) number such that the \((1-p)\)-confidence interval of the mean difference constructed from the observed data does not contain \(0\). We can see below that when \(p=0\), the \(100\%\) bootstrap confidence interval (i.e. the whole range) still does not contain \(0\).
quantile(bootstrap.diff.samples, c(0,1))
## 0% 100%
## 0.2135012 0.2290387
The problem with this alternative method is that the bootstrap confidence interval is not really a confidence interval in the strict sense (e.g., a strict \(100\%\) CI should contain all possible values, which clearly is not the case in the above example), so the estimated \(p\)-value will not be very accurate. (I think the first method has similar problems.)