The english dataset in the languageR library contains “mean visual lexical decision latencies and word naming latencies to 2284 monomorphemic English nouns and verbs, averaged for old and young subjects, with various predictor variables.” Use ?english and read the detailed information about this dataset.

library(languageR)
head(english)
##   RTlexdec RTnaming Familiarity   Word AgeSubject WordCategory
## 1 6.543754 6.145044        2.37    doe      young            N
## 2 6.397596 6.246882        4.43  whore      young            N
## 3 6.304942 6.143756        5.60 stress      young            N
## 4 6.424221 6.131878        3.87   pork      young            N
## 5 6.450597 6.198479        3.93   plug      young            N
## 6 6.531970 6.167726        3.27   prop      young            N
##   WrittenFrequency WrittenSpokenFrequencyRatio FamilySize
## 1         3.912023                   1.0216512   1.386294
## 2         4.521789                   0.3504830   1.386294
## 3         6.505784                   2.0893560   1.609438
## 4         5.017280                  -0.5263339   1.945910
## 5         4.890349                  -1.0445451   2.197225
## 6         4.770685                   0.9248014   1.386294
##   DerivationalEntropy InflectionalEntropy NumberSimplexSynsets
## 1             0.14144             0.02114            0.6931472
## 2             0.42706             0.94198            1.0986123
## 3             0.06197             1.44339            2.4849066
## 4             0.43035             0.00000            1.0986123
## 5             0.35920             1.75393            2.4849066
## 6             0.06268             1.74730            1.6094379
##   NumberComplexSynsets LengthInLetters Ncount MeanBigramFrequency
## 1             0.000000               3      8            7.036333
## 2             0.000000               5      5            9.537878
## 3             1.945910               6      0            9.883931
## 4             2.639057               4      8            8.309180
## 5             2.484907               4      3            7.943717
## 6             1.386294               4      9            8.349620
##   FrequencyInitialDiphone ConspelV ConspelN ConphonV ConphonN ConfriendsV
## 1                12.02268       10 3.737670       41 8.837826           8
## 2                12.59780       20 7.870930       38 9.775825          20
## 3                13.30069       10 6.693324       13 7.040536          10
## 4                12.07807        5 6.677083        6 3.828641           4
## 5                11.92678       17 4.762174       17 4.762174          17
## 6                12.19724       19 6.234411       21 6.249975          19
##   ConfriendsN    ConffV   ConffN   ConfbV   ConfbN NounFrequency
## 1    3.295837 0.6931472 2.708050 3.496508 8.833900            49
## 2    7.870930 0.0000000 0.000000 2.944439 9.614738           142
## 3    6.693324 0.0000000 0.000000 1.386294 5.817111           565
## 4    3.526361 0.6931472 6.634633 1.098612 2.564949           150
## 5    4.762174 0.0000000 0.000000 0.000000 0.000000           170
## 6    6.234411 0.0000000 0.000000 1.098612 2.197225           125
##   VerbFrequency CV Obstruent Frication     Voice
## 1             0  C      obst     burst    voiced
## 2             0  C      obst frication voiceless
## 3           473  C      obst frication voiceless
## 4             0  C      obst     burst voiceless
## 5           120  C      obst     burst voiceless
## 6           280  C      obst     burst voiceless
##   FrequencyInitialDiphoneWord FrequencyInitialDiphoneSyllable
## 1                   10.129308                       10.409763
## 2                    9.054388                        9.148252
## 3                   12.422026                       13.127395
## 4                   10.048151                       11.003649
## 5                   11.796336                       12.163092
## 6                   11.991567                       12.436772
##   CorrectLexdec
## 1            27
## 2            30
## 3            30
## 4            30
## 5            26
## 6            28
english[1,]
##   RTlexdec RTnaming Familiarity Word AgeSubject WordCategory
## 1 6.543754 6.145044        2.37  doe      young            N
##   WrittenFrequency WrittenSpokenFrequencyRatio FamilySize
## 1         3.912023                    1.021651   1.386294
##   DerivationalEntropy InflectionalEntropy NumberSimplexSynsets
## 1             0.14144             0.02114            0.6931472
##   NumberComplexSynsets LengthInLetters Ncount MeanBigramFrequency
## 1                    0               3      8            7.036333
##   FrequencyInitialDiphone ConspelV ConspelN ConphonV ConphonN ConfriendsV
## 1                12.02268       10  3.73767       41 8.837826           8
##   ConfriendsN    ConffV  ConffN   ConfbV ConfbN NounFrequency
## 1    3.295837 0.6931472 2.70805 3.496508 8.8339            49
##   VerbFrequency CV Obstruent Frication  Voice FrequencyInitialDiphoneWord
## 1             0  C      obst     burst voiced                    10.12931
##   FrequencyInitialDiphoneSyllable CorrectLexdec
## 1                        10.40976            27

The first row, indexed by english[1,], tells us how long, in log milliseconds, participants in the “young” category took on average to decide whether “doe” (word[1,]$Word) is a word. The average was \(e^{6.5437536} = 694.89\) ms. The dataset also includes lots of interesting additional information, for example that english[1,]$WrittenFrequency - the frequency of “doe” in the CELEX lexical database - is \(e^{3.912023} = 50\) words per million.

  1. What is the correlation between log lexical decision time and log frequency-per-million-in-CELEX?
cor(english$RTlexdec, english$WrittenFrequency)
## [1] -0.434815
  1. Is log lexical decision time approximately normally distributed? To help decide, make a density plot (density()) and use curve(dnorm(x, mean = ..., sd= ...), add=TRUE) to overlay a normal approximation computed using the sample mean and standard deviation.
plot(density(english$RTlexdec), main="density of log lexical decision time", col="black")
curve(dnorm(x, mean=mean(english$RTlexdec), sd=sd(english$RTlexdec)), add=TRUE, lty=2, col="black")
legend("topright", 
       legend=c("density", "normal approximation"),
       lty=c(1,2),
       col=c("black", "black"))

We can see that although the normal approximation is rather close to the density plot, the density seems to be bimodal, which is not captured by the normal approximation.

The variable representing the age of the participants from whom the observations are gathered comes in two values, old and young. We can see this by typing levels(english$AgeSubject). We’ll divide the data into two parts, old and young.

levels(english$AgeSubject)
## [1] "old"   "young"
old = subset(english, AgeSubject == 'old')
young = subset(english, AgeSubject == 'young')
  1. Are log lexical decision times approximately normally distributed for each subset? Impressionistically, how does the approximation compare for each subset to that of the combined data set?
plot(density(english$RTlexdec), main="densities of log lexical decision time ", col="black", ylim=c(0,4))
curve(dnorm(x, mean=mean(english$RTlexdec), sd=sd(english$RTlexdec)), add=TRUE, lty=2, col="black")
lines(density(old$RTlexdec), main="density of log lexical decision time", col="blue")
curve(dnorm(x, mean=mean(old$RTlexdec), sd=sd(old$RTlexdec)), add=TRUE, lty=2, col="blue")
lines(density(young$RTlexdec),  col="orange")
curve(dnorm(x, mean=mean(young$RTlexdec), sd=sd(young$RTlexdec)), add=TRUE, lty=2, col="orange")

legend("topright", 
       legend=c("density:all", "normal approx:all", 
                "density:old", "normal approx:old", 
                "density:young", "normal approx:young"),
       lty=c(1,2,1,2,1,2),
       col=c("black", "black", "blue", "blue","red", "red"))

We can see that for each subset, the density seems to be right-skewed, in contrast with its symmetric normal approximation (whose mode is to the right of that of the density curve). But at least the density of each subset has only one mode, and we can see that although the normal approximation of each subset has its mode to the right of that of the density curve, the model of the normal approximation falls between the two modes in the density of the entire dataset, making the normal approximation appear to fit the density relatively closely.

The order of words is the same in old and young. (Try typing old$Word == young$Word to verify this.) So, we can find the difference between log mean lexical decision time in the “old” and “young” groups by simply subtracting one from the other:

rt.diff = old$RTlexdec - young$RTlexdec
  1. What are the sample mean and standard deviation of rt.diff? Plot its kernel density and overlap a normal approximation. How good is the normal approximation?
mean(rt.diff)
## [1] 0.2217215
sd(rt.diff)
## [1] 0.091446
plot(density(rt.diff), lty=1, main="difference in logRT between old and young")
curve(dnorm(x, mean=mean(rt.diff), sd=sd(rt.diff)), add=TRUE, lty=2)
legend("topright", 
       legend=c("density:all", "normal approx:all"),
       lty=c(1,2),
       col=c("black", "black"))

We can see that the fit of the normal approximation is quite good, although the density seems slightly more centered to the mean than the normal approximation.

  1. Compute a 95% frequentist confidence interval for the difference in mean RTs between the two groups using (a) bootstrapping,
resample.meandiff <- function(i){
  return(mean(sample(rt.diff, size=length(rt.diff), replace=TRUE)))
}

bootstrap.diff.samples <- sapply(1:1000, resample.meandiff)
bootstrap.ci <- quantile(bootstrap.diff.samples, prob=c(.025, .975))
bootstrap.ci
##      2.5%     97.5% 
## 0.2178804 0.2253753

and (b) a \(t\)-test. (Use ?t.test to get information on the latter.)

t.test(rt.diff)$conf.int
## [1] 0.2179692 0.2254737
## attr(,"conf.level")
## [1] 0.95
  1. How do the two confidence intervals compare? Suppose that the distribution were not approximately normal - which should you prefer?

We can see that the two confidence intervals are very close.

In general, when the distribution is not approximately normal, bootstrapping, being a non-parametric method, should be preferred.

  1. What is the \(p\)-value of the difference? Is the difference “statistically significant”? In plain English, what does this mean, in terms of the confidence intervals and the meaning of frequentist CIs?
t.test(rt.diff)
## 
##  One Sample t-test
## 
## data:  rt.diff
## t = 115.8753, df = 2283, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.2179692 0.2254737
## sample estimates:
## mean of x 
## 0.2217215

We can see that for the \(t\)-test, the \(p\)-value is less than 2.2e-16, which is very small. In fact, it is so small that it will display 0 when we directly access it.

t.test(rt.diff)$p.value
## [1] 0

This means that it is statistically significant that the difference is not \(0\). However, the \(p\)-value says nothing about how much (i.e., the effect size) the difference is different from \(0\). We can see that the difference is estimated to be

t.test(rt.diff)$estimate
## mean of x 
## 0.2217215

Whether a difference of such a size is big enough to be of any interest (“significant” in common usage) is independent of the \(p\)-value.

The \(p\)-value is the smallest (non-negative) number such that the \((1-p)\)-confidence interval of the mean difference constructed from the observed data does not contain \(0\) (the null hypothesis).

The method to construct a Frequentist \((1-\alpha)\)-confidence interval, by definition, ensures that if we were to sample again and again from the population data of the same size and apply the method to construct confidence intervals, \((1-\alpha)\) of the time the CI will contain the true population mean difference, which under the null hypothesis is \(0\). This means that, under the null hypothesis that the true population mean difference is \(0\), that the \((1-\alpha)\)-confidence interval constructed from the observed data does not contain \(0\) should happen only \(\alpha\) of the time. Therefore if we reject the null hypothesis on the basis that the \((1-\alpha)\)-confidence interval does not contain \(0\), only \(\alpha\) of the time will we be wrong (i.e., make a type I error).

Now, \(p\)-value is by definition the smallest \(\alpha\) for which the null hypothesis will be rejected, and we know from above that the null hypothesis is rejected when the \((1-\alpha)\)-confidence interval constructed from the observed data does not contain \(0\). Therefore, \(p\)-value is the smallest number such that the \((1-p)\)-confidence interval of the mean difference constructed from the observed data does not contain \(0\).

Calculating the \(p\)-value from bootstrapping is tricky, since we have to somehow use the empirical distribution to simulate the population under the null hypothesis that the mean difference is 0. One way is to first subtract each of the observed data by the sample mean, then do bootstrapping, and then calculate the \(p\)-value as the probability of observed data or more extreme ones. However, since we substracted a constant from each of the observed data, by linear transformation the above is equivalent to the probability of observing mean difference more extreme than twice the sample mean (under the null hypothesis that the population mean is the sample mean). We can calculate it as below

mean(bootstrap.diff.samples<=0 | bootstrap.diff.samples>=2*mean(rt.diff))
## [1] 0

Alternatively, we can use the conclusion that \(p\)-value is the smallest (non-negative) number such that the \((1-p)\)-confidence interval of the mean difference constructed from the observed data does not contain \(0\). We can see below that when \(p=0\), the \(100\%\) bootstrap confidence interval (i.e. the whole range) still does not contain \(0\).

quantile(bootstrap.diff.samples, c(0,1))
##        0%      100% 
## 0.2135012 0.2290387

The problem with this alternative method is that the bootstrap confidence interval is not really a confidence interval in the strict sense (e.g., a strict \(100\%\) CI should contain all possible values, which clearly is not the case in the above example), so the estimated \(p\)-value will not be very accurate. (I think the first method has similar problems.)