Analyzing Reaction Times

Back to our favorite data set: lexical decision times.

To refresh your memory from last time, here is our description: You will read in some data from a lexical decision time task for English. You can read about it here, on pages 30-31, for the data set called “english.” (This documentation is the standard documentation for R packages, so it is good to get familiar with reading it.)

That data is saved in a csv called “english.csv” Below, I read in the data for you, as a data frame called d.

Here is how a lexical decision task works. In a lexical decision task, a subject is presented with a word (like “dog”) or a plausible non-word (like “florp”) and asked to judge as quickly as possible whether or not what they saw is a word.

How fast people can make a decision reflects something about the psychological response of the subject to the word in question. In this part of the problem set, you will find out what sorts of things are predictive of lexical decision time.

Below, we read in the data set and make a column called “RT,” which is the average time in milliseconds that it took participants to respond to a given word.

d = read_csv("../pset1/english.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Word = col_character(),
##   AgeSubject = col_character(),
##   WordCategory = col_character(),
##   CV = col_character(),
##   Obstruent = col_character(),
##   Frication = col_character(),
##   Voice = col_character()
## )

## See spec(...) for full column specifications.

d$RT = exp(d$RTlexdec)
d$WrittenCount = exp(d$WrittenFrequency)
head(d)

## # A tibble: 6 x 38
##   RTlexdec RTnaming Familiarity Word  AgeSubject WordCategory WrittenFrequency
##      <dbl>    <dbl>       <dbl> <chr> <chr>      <chr>                   <dbl>
## 1     6.54     6.15        2.37 doe   young      N                        3.91
## 2     6.30     6.14        5.6  stre… young      N                        6.51
## 3     6.42     6.13        3.87 pork  young      N                        5.02
## 4     6.45     6.20        3.93 plug  young      N                        4.89
## 5     6.53     6.17        3.27 prop  young      N                        4.77
## 6     6.37     6.12        3.73 dawn  young      N                        6.38
## # … with 31 more variables: WrittenSpokenFrequencyRatio <dbl>,
## #   FamilySize <dbl>, DerivationalEntropy <dbl>, InflectionalEntropy <dbl>,
## #   NumberSimplexSynsets <dbl>, NumberComplexSynsets <dbl>,
## #   LengthInLetters <dbl>, Ncount <dbl>, MeanBigramFrequency <dbl>,
## #   FrequencyInitialDiphone <dbl>, ConspelV <dbl>, ConspelN <dbl>,
## #   ConphonV <dbl>, ConphonN <dbl>, ConfriendsV <dbl>, ConfriendsN <dbl>,
## #   ConffV <dbl>, ConffN <dbl>, ConfbV <dbl>, ConfbN <dbl>,
## #   NounFrequency <dbl>, VerbFrequency <dbl>, CV <chr>, Obstruent <chr>,
## #   Frication <chr>, Voice <chr>, FrequencyInitialDiphoneWord <dbl>,
## #   FrequencyInitialDiphoneSyllable <dbl>, CorrectLexdec <dbl>, RT <dbl>,
## #   WrittenCount <dbl>

1. Means and medians

Remember: the RT column tells you the average lexical decision time in milliseconds.

Report the mean, median, max, and min for RT.

Mean is 708.1572, median is 699.59, maximum is 1323.2, minimum is 495.38

mean(d$RT) # Mean

## [1] 708.1572

median(d$RT) # Median

## [1] 699.59

max(d$RT) # Maximum

## [1] 1323.2

min(d$RT) # Minimum

## [1] 495.38

2 WrittenCount Statistics

Report the mean, median, max, and min for the column WrittenCount. This is the number of times that a particular word occurs in a corpus. For instance, if the value is 10, that means the word occurred 10 times occurred in the corpus.

Is the mean or median higher?

The mean is 975.4 and the median is 126. The mean is higher than the median.

mean(d$WrittenCount) # Mean

## [1] 975.4088

median(d$WrittenCount) # Median

## [1] 126

max(d$WrittenCount) # Maximum

## [1] 85533

min(d$WrittenCount) # Minimum

## [1] 1

3 Written Count Mean

Make a histogram (with breaks=200) of the variable WrittenCount. Describe its shape.

The bars at the very left (close to 0) are very high. Then, the height of the bars decrease significantly as the value of WrittenCount increases. There’s a really long tail. As WrittenCount increase more, most bars have a height that’s close to 0.

hist(d$WrittenCount, breaks=200)

Data with this shape will always have the mean higher than the median. Using your Answer to #2, discuss why this might be the case.

The median is the halfway point of the data. 50% of the data is higher than the median, and 50% of the data is lower than the median. In this data, most words have a WrittenCount of 1. This means that a large proportion of the data has a very low value. This lowers the median. On the other hand, there are words with extremely high WrittenCount, but the number of such words is very low. These data points are rare but have high values for WrittenCount. This highers the mean. The mean is like a balance point, and it is easily affected by extreme values. Thus, the mean is very high, much higher than the median.

4 Familiarity

What you did in Questions 2-3 was characterize the distribution of the variable WrittenCount. Now do the same for Familarity by reporting the max, min, median, and mean and by making a histogram (set breaks=30).

mean(d$Familiarity) # Mean

## [1] 3.795443

median(d$Familiarity) # Median

## [1] 3.7

max(d$Familiarity) # Maximum

## [1] 6.97

min(d$Familiarity) # Minimum

## [1] 1.1

hist(d$Familiarity, breaks=30)

How do the mean and median compare? Are the mean and the median closer to each other or farther apart than the mean and median for WrittenCount?

The mean for Familiarity is 3.795 and the median is 3.7. They are very close to each other.

5 Is WrittenCount normally distributed?

5a.

First, get the mean and standard deviation (using the function sd()) for the variable WrittenCount. Let’s call them WrittenCount.mean and WrittenCount.sd, respectively. Using that mean and standard deviation, generate a sample of 4000 numbers from a normal distribution that has mean WrittenCount.mean and standard deviation WrittenCount.sd.

Make a histogram of this sample. Then, sample again and make a histogram again. Do this 4 times, so that you have 4 histograms. (HINT: you might have to use print() around your plot to make it appear.)

WrittenCount.mean = mean(d$WrittenCount)
WrittenCount.sd = sd(d$WrittenCount)

s1 <- rnorm(4000, mean = WrittenCount.mean, sd = WrittenCount.sd)
s2 <- rnorm(4000, mean = WrittenCount.mean, sd = WrittenCount.sd)
s3 <- rnorm(4000, mean = WrittenCount.mean, sd = WrittenCount.sd)
s4 <- rnorm(4000, mean = WrittenCount.mean, sd = WrittenCount.sd)

print(hist(s1))

## $breaks
##  [1] -16000 -14000 -12000 -10000  -8000  -6000  -4000  -2000      0   2000
## [11]   4000   6000   8000  10000  12000  14000  16000
## 
## $counts
##  [1]   1   2   6  29 104 277 479 721 796 734 444 267 103  26   9   2
## 
## $density
##  [1] 1.2500e-07 2.5000e-07 7.5000e-07 3.6250e-06 1.3000e-05 3.4625e-05
##  [7] 5.9875e-05 9.0125e-05 9.9500e-05 9.1750e-05 5.5500e-05 3.3375e-05
## [13] 1.2875e-05 3.2500e-06 1.1250e-06 2.5000e-07
## 
## $mids
##  [1] -15000 -13000 -11000  -9000  -7000  -5000  -3000  -1000   1000   3000
## [11]   5000   7000   9000  11000  13000  15000
## 
## $xname
## [1] "s1"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

print(hist(s2))

## $breaks
##  [1] -18000 -16000 -14000 -12000 -10000  -8000  -6000  -4000  -2000      0
## [11]   2000   4000   6000   8000  10000  12000  14000
## 
## $counts
##  [1]   1   0   1   8  37 108 269 500 772 772 681 462 240 109  31   9
## 
## $density
##  [1] 1.2500e-07 0.0000e+00 1.2500e-07 1.0000e-06 4.6250e-06 1.3500e-05
##  [7] 3.3625e-05 6.2500e-05 9.6500e-05 9.6500e-05 8.5125e-05 5.7750e-05
## [13] 3.0000e-05 1.3625e-05 3.8750e-06 1.1250e-06
## 
## $mids
##  [1] -17000 -15000 -13000 -11000  -9000  -7000  -5000  -3000  -1000   1000
## [11]   3000   5000   7000   9000  11000  13000
## 
## $xname
## [1] "s2"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

print(hist(s3))

## $breaks
##  [1] -14000 -12000 -10000  -8000  -6000  -4000  -2000      0   2000   4000
## [11]   6000   8000  10000  12000  14000  16000
## 
## $counts
##  [1]   2   4  40 104 266 492 725 802 681 491 244 103  32  10   4
## 
## $density
##  [1] 2.5000e-07 5.0000e-07 5.0000e-06 1.3000e-05 3.3250e-05 6.1500e-05
##  [7] 9.0625e-05 1.0025e-04 8.5125e-05 6.1375e-05 3.0500e-05 1.2875e-05
## [13] 4.0000e-06 1.2500e-06 5.0000e-07
## 
## $mids
##  [1] -13000 -11000  -9000  -7000  -5000  -3000  -1000   1000   3000   5000
## [11]   7000   9000  11000  13000  15000
## 
## $xname
## [1] "s3"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

print(hist(s4))

## $breaks
##  [1] -12000 -10000  -8000  -6000  -4000  -2000      0   2000   4000   6000
## [11]   8000  10000  12000  14000  16000  18000
## 
## $counts
##  [1]  13  21 124 271 480 737 784 686 490 250  87  42  13   1   1
## 
## $density
##  [1] 1.6250e-06 2.6250e-06 1.5500e-05 3.3875e-05 6.0000e-05 9.2125e-05
##  [7] 9.8000e-05 8.5750e-05 6.1250e-05 3.1250e-05 1.0875e-05 5.2500e-06
## [13] 1.6250e-06 1.2500e-07 1.2500e-07
## 
## $mids
##  [1] -11000  -9000  -7000  -5000  -3000  -1000   1000   3000   5000   7000
## [11]   9000  11000  13000  15000  17000
## 
## $xname
## [1] "s4"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

5b.

Now make a histogram of WrittenCount itself.

hist(d$WrittenCount)

Compare the histogram in 5b to the histograms in part 5a. Remember: you generated the normal samples in 5a using the same mean and the same standard deviation as WrittenCount. Why do the plots look different?

The plots look very different. The normal samples in 5a are approximately normally distributed. However, the Written Count is not normally distributed. Even though the plots are made with the same mean and standard deviation, they don’t have share the same/similar distribution.

6 Is Familarity normally distributed?

Repeat all parts of 5a and 5b, but using Familiarity instead of WrittenCount.

3 points Extra Credit: Turn the code in 5a and 5b into a function (or a couple functions) that tak “WrittenCount” as an argument. Then, for Q6 and Q7, you can just call those function(s).

# Function defined below
whether_normal_distribution <- function(values) {
  values.mean = mean(values)
  values.sd = sd(values)

  s1 <- rnorm(4000, mean = values.mean, sd = values.sd)
  s2 <- rnorm(4000, mean = values.mean, sd = values.sd)
  s3 <- rnorm(4000, mean = values.mean, sd = values.sd)
  s4 <- rnorm(4000, mean = values.mean, sd = values.sd)

  print(hist(s1))
  print(hist(s2))
  print(hist(s3))
  print(hist(s4))
  hist(values)
}

# For question 6
whether_normal_distribution(d$Familiarity)

## $breaks
##  [1] -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0
## [16]  6.5  7.0  7.5  8.0
## 
## $counts
##  [1]   1   2  10  27  61 168 284 443 646 664 632 482 316 148  76  32   7   1
## 
## $density
##  [1] 0.0005 0.0010 0.0050 0.0135 0.0305 0.0840 0.1420 0.2215 0.3230 0.3320
## [11] 0.3160 0.2410 0.1580 0.0740 0.0380 0.0160 0.0035 0.0005
## 
## $mids
##  [1] -0.75 -0.25  0.25  0.75  1.25  1.75  2.25  2.75  3.25  3.75  4.25  4.75
## [13]  5.25  5.75  6.25  6.75  7.25  7.75
## 
## $xname
## [1] "s1"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
##  [1] -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5
## [16]  7.0  7.5  8.0
## 
## $counts
##  [1]   1   5  23  76 153 275 449 611 692 640 483 310 180  72  24   4   2
## 
## $density
##  [1] 0.0005 0.0025 0.0115 0.0380 0.0765 0.1375 0.2245 0.3055 0.3460 0.3200
## [11] 0.2415 0.1550 0.0900 0.0360 0.0120 0.0020 0.0010
## 
## $mids
##  [1] -0.25  0.25  0.75  1.25  1.75  2.25  2.75  3.25  3.75  4.25  4.75  5.25
## [13]  5.75  6.25  6.75  7.25  7.75
## 
## $xname
## [1] "s2"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
##  [1] -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0
## [16]  6.5  7.0  7.5  8.0
## 
## $counts
##  [1]   3   1   3  22  69 149 298 453 632 686 614 498 306 169  66  24   5   2
## 
## $density
##  [1] 0.0015 0.0005 0.0015 0.0110 0.0345 0.0745 0.1490 0.2265 0.3160 0.3430
## [11] 0.3070 0.2490 0.1530 0.0845 0.0330 0.0120 0.0025 0.0010
## 
## $mids
##  [1] -0.75 -0.25  0.25  0.75  1.25  1.75  2.25  2.75  3.25  3.75  4.25  4.75
## [13]  5.25  5.75  6.25  6.75  7.25  7.75
## 
## $xname
## [1] "s3"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
##  [1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## 
## $counts
##  [1]   5  34  66 146 283 461 670 677 615 489 281 167  74  22   6   4
## 
## $density
##  [1] 0.0025 0.0170 0.0330 0.0730 0.1415 0.2305 0.3350 0.3385 0.3075 0.2445
## [11] 0.1405 0.0835 0.0370 0.0110 0.0030 0.0020
## 
## $mids
##  [1] 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25
## [16] 7.75
## 
## $xname
## [1] "s4"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Compare Questions 5 and 6. Is Familarity closer to the normal distribution than WrittenCount?

Yes, Familiarity is closer to the normal distribution than WrittenCount.

7 Is log(WrittenCount) normally distributed?

Here we go again: Repeat all parts of 5a and 5b using log(WrittenCount) instead of WrittenCount. (Or, you can use your function.)

d$log.WrittenCount = log(d$WrittenCount)

whether_normal_distribution(d$log.WrittenCount)

## $breaks
##  [1] -3 -2 -1  0  1  2  3  4  5  6  7  8  9 10 11 12 13
## 
## $counts
##  [1]   2   0  16  51 145 347 604 824 817 611 372 148  52   9   1   1
## 
## $density
##  [1] 0.00050 0.00000 0.00400 0.01275 0.03625 0.08675 0.15100 0.20600 0.20425
## [10] 0.15275 0.09300 0.03700 0.01300 0.00225 0.00025 0.00025
## 
## $mids
##  [1] -2.5 -1.5 -0.5  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5
## [16] 12.5
## 
## $xname
## [1] "s1"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
##  [1] -1  0  1  2  3  4  5  6  7  8  9 10 11 12 13
## 
## $counts
##  [1]   8  60 137 341 614 863 775 633 365 147  37  16   3   1
## 
## $density
##  [1] 0.00200 0.01500 0.03425 0.08525 0.15350 0.21575 0.19375 0.15825 0.09125
## [10] 0.03675 0.00925 0.00400 0.00075 0.00025
## 
## $mids
##  [1] -0.5  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5
## 
## $xname
## [1] "s2"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
##  [1] -3 -2 -1  0  1  2  3  4  5  6  7  8  9 10 11 12
## 
## $counts
##  [1]   1   2  10  53 143 345 591 828 830 612 361 158  54   9   3
## 
## $density
##  [1] 0.00025 0.00050 0.00250 0.01325 0.03575 0.08625 0.14775 0.20700 0.20750
## [10] 0.15300 0.09025 0.03950 0.01350 0.00225 0.00075
## 
## $mids
##  [1] -2.5 -1.5 -0.5  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5
## 
## $xname
## [1] "s3"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
##  [1] -2 -1  0  1  2  3  4  5  6  7  8  9 10 11 12
## 
## $counts
##  [1]   1   9  43 129 351 615 830 854 614 350 144  48  10   2
## 
## $density
##  [1] 0.00025 0.00225 0.01075 0.03225 0.08775 0.15375 0.20750 0.21350 0.15350
## [10] 0.08750 0.03600 0.01200 0.00250 0.00050
## 
## $mids
##  [1] -1.5 -0.5  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5
## 
## $xname
## [1] "s4"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Is log(WrittenCount) normally distributed?

Yes, log(WrittenCount) is approximately normally distributed.

8 If we sample the mean of WrittenCount, is that normally distributed?

Remember the sample() function? That lets you take a sample from your data. Fill in the function below. There is a for loop here, which means that what is inside the curly brackets is done repeatedly.

8a.

a = NULL
for (i in 1:1000) {
  written.count.sample = sample(d$WrittenCount, 1000)
  # TODO: replace the NULL to the left, call the function sample() to get a sample 
                         # of 1000 data points from d$WrittenCount, 
  mean.sample = mean(written.count.sample) #= # TODO: take the mean of written.count.sample
  a = append(a, mean.sample)
}
head(a)

## [1]  959.246 1130.523 1105.910  939.783 1196.877  956.866

a contains 1000 numbers; these numbers are all sample means of 1000 data points sampled from d$WrittenCount.

8b.

Make a histogram of the variable a from 8a.

hist(a)

Why does this look normal, even though d$WrittenCount itself is not normal? (HINT: you might google “sample mean” and “central limit theorem.”)

The sample mean from a group of observations is an estimate of the population mean. According to central limit theorem, when you take sufficiently large random samples from the population, the distribution of the sample means will be approximately normally distributed. Thus, when we sample 1000 data points from the column of Written Count many times, the sample means (stored in vector a) would be approximately normally distributed, as in the histogram above.

9 Practicing ggplot(): make two histograms next to each other, one for young and one for old.

Now we are interested in using the kinds of statistics we’ve been practicing above to ask some questions about our data.

The first question is: are the RTS of young participants actually faster than the RTs of older participants? Or is that just due to chance?

First, using ggplot(), geom_histogram(), and facet_grid(), make 2 histograms, side by side, that show the distribution of young participants and old participants.

EXTRA CREDIT (3 pts): The histograms, side by side, aren’t really the best way to visualize this. Try to come up with a better way to make a visualization that answers the question: are young and old RTs really systematiaclly different?

young <- filter(d, d$AgeSubject == 'young'); young

## # A tibble: 2,283 x 39
##    RTlexdec RTnaming Familiarity Word  AgeSubject WordCategory WrittenFrequency
##       <dbl>    <dbl>       <dbl> <chr> <chr>      <chr>                   <dbl>
##  1     6.54     6.15        2.37 doe   young      N                        3.91
##  2     6.30     6.14        5.6  stre… young      N                        6.51
##  3     6.42     6.13        3.87 pork  young      N                        5.02
##  4     6.45     6.20        3.93 plug  young      N                        4.89
##  5     6.53     6.17        3.27 prop  young      N                        4.77
##  6     6.37     6.12        3.73 dawn  young      N                        6.38
##  7     6.27     6.10        5.67 dog   young      N                        7.16
##  8     6.61     6.12        3.1  arc   young      N                        4.89
##  9     6.28     6.18        4.43 skirt young      N                        5.93
## 10     6.61     6.20        3.27 spree young      N                        3.40
## # … with 2,273 more rows, and 32 more variables:
## #   WrittenSpokenFrequencyRatio <dbl>, FamilySize <dbl>,
## #   DerivationalEntropy <dbl>, InflectionalEntropy <dbl>,
## #   NumberSimplexSynsets <dbl>, NumberComplexSynsets <dbl>,
## #   LengthInLetters <dbl>, Ncount <dbl>, MeanBigramFrequency <dbl>,
## #   FrequencyInitialDiphone <dbl>, ConspelV <dbl>, ConspelN <dbl>,
## #   ConphonV <dbl>, ConphonN <dbl>, ConfriendsV <dbl>, ConfriendsN <dbl>,
## #   ConffV <dbl>, ConffN <dbl>, ConfbV <dbl>, ConfbN <dbl>,
## #   NounFrequency <dbl>, VerbFrequency <dbl>, CV <chr>, Obstruent <chr>,
## #   Frication <chr>, Voice <chr>, FrequencyInitialDiphoneWord <dbl>,
## #   FrequencyInitialDiphoneSyllable <dbl>, CorrectLexdec <dbl>, RT <dbl>,
## #   WrittenCount <dbl>, log.WrittenCount <dbl>

old <- filter(d, d$AgeSubject == 'old'); old

## # A tibble: 2,284 x 39
##    RTlexdec RTnaming Familiarity Word  AgeSubject WordCategory WrittenFrequency
##       <dbl>    <dbl>       <dbl> <chr> <chr>      <chr>                   <dbl>
##  1     6.66     6.42        2.37 doe   old        N                        3.91
##  2     6.68     6.64        4.43 whore old        N                        4.52
##  3     6.54     6.49        5.6  stre… old        N                        6.51
##  4     6.55     6.40        3.87 pork  old        N                        5.02
##  5     6.64     6.42        3.93 plug  old        N                        4.89
##  6     6.76     6.53        3.27 prop  old        N                        4.77
##  7     6.60     6.47        3.73 dawn  old        N                        6.38
##  8     6.57     6.42        5.67 dog   old        N                        7.16
##  9     6.82     6.47        3.1  arc   old        N                        4.89
## 10     6.66     6.55        4.43 skirt old        N                        5.93
## # … with 2,274 more rows, and 32 more variables:
## #   WrittenSpokenFrequencyRatio <dbl>, FamilySize <dbl>,
## #   DerivationalEntropy <dbl>, InflectionalEntropy <dbl>,
## #   NumberSimplexSynsets <dbl>, NumberComplexSynsets <dbl>,
## #   LengthInLetters <dbl>, Ncount <dbl>, MeanBigramFrequency <dbl>,
## #   FrequencyInitialDiphone <dbl>, ConspelV <dbl>, ConspelN <dbl>,
## #   ConphonV <dbl>, ConphonN <dbl>, ConfriendsV <dbl>, ConfriendsN <dbl>,
## #   ConffV <dbl>, ConffN <dbl>, ConfbV <dbl>, ConfbN <dbl>,
## #   NounFrequency <dbl>, VerbFrequency <dbl>, CV <chr>, Obstruent <chr>,
## #   Frication <chr>, Voice <chr>, FrequencyInitialDiphoneWord <dbl>,
## #   FrequencyInitialDiphoneSyllable <dbl>, CorrectLexdec <dbl>, RT <dbl>,
## #   WrittenCount <dbl>, log.WrittenCount <dbl>

y = tibble(RT=young$RT, Age='young')
o = tibble(RT=old$RT, Age='old')

both = bind_rows(y, o); both

## # A tibble: 4,567 x 2
##       RT Age  
##    <dbl> <chr>
##  1  695. young
##  2  547. young
##  3  617. young
##  4  633. young
##  5  687. young
##  6  584. young
##  7  527. young
##  8  741. young
##  9  536. young
## 10  740. young
## # … with 4,557 more rows

# Side by side
side.by.side = ggplot(both, aes(x=RT, fill=Age)) +
  geom_histogram(alpha=.6)  +
  facet_grid(. ~ Age)
print(side.by.side)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Extra Credit: Better way to visualize 
better.way = ggplot(both, aes(x=RT, fill=Age)) +
  geom_histogram(alpha=0.5, position="identity")
print(better.way)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

10 Quantitative comparison of Reaction Times for young and old participants [This que]

To figure out if there is really a difference, we need to compare to something. So let’s compare to random chance.

As a first step, let’s get a data frame that has the following columns: word, RT for young people on that word, RT for older people on that word.

I’ll do this for you, but pay attention to the code so you can follow along!

d.diff = mutate(d, Word = paste(Word, WordCategory)) %>%
  select(Word, AgeSubject, RT) %>%
  spread(AgeSubject, RT) %>%
  mutate(RT.diff = old - young)

d.diff

## # A tibble: 2,284 x 4
##    Word      old young RT.diff
##    <chr>   <dbl> <dbl>   <dbl>
##  1 ace N    776.  624.   152. 
##  2 act V    716.  617.    98.4
##  3 add V    742.  576.   166. 
##  4 age N    748.  592.   156. 
##  5 aid V    825.  542.   283. 
##  6 aide N   895.  694.   202. 
##  7 ail V    876.  749.   127. 
##  8 aim V    734.  592.   142. 
##  9 air N    688.  512.   176. 
## 10 aisle N  860.  663.   197. 
## # … with 2,274 more rows

Cool! Now we have a data frame d.diff, which contains a Word (concatenated with its part of speech, so “launch” appears as “launch N” or “launch V”), its RT for older participants, its RT for younger participants, and the difference.

Here is a histogram of RT.diff. It’s not quite normal, but we will pretend like it is.

hist(d.diff$RT.diff, breaks=30)

rt.diff.mean = mean(d.diff$RT.diff, na.rm=T); rt.diff.mean

## [1] 157.1567

rt.diff.sd = sd(d.diff$RT.diff, na.rm=T); rt.diff.sd

## [1] 72.71865

Now here is your task. We are going to consider the following scenario: let’s say that the difference between old and young RTs for each word is generated by a normal distribution with mean 0. This is equivalent to saying there is no difference between young and old RTs.

Let’s simulate that scenario 1000 times. Each time we do the simulation, we record the simulation mean in the vector a.

# Replace the NULL's when you do the task, those are just placeholders
a = NULL
for (i in 1:1000) {
  r.norm = rnorm(2284, 0, rt.diff.sd)
  # simulate 2284 data points (which is the number of rows in d.diff) drawn from a 
           # normal distribution with mean 0 and sd == rt.diff.sd
  r.norm.mean  = mean(r.norm) # take the mean of r.norm
  a = append(a, r.norm.mean)
}

10a

Show quantitatively how often the observed mean (which we computed as rt.diff.mean) is larger than our 1000 simulated sample means.

mean(a < rt.diff.mean)

## [1] 1

10b

Now let’s work on a plot that I’ll help make for you, but, using your knowledge of the code, fill in the axis labels with informative labels and write a caption (2-3 sentences) describing the plot. In particular, look at the red point and understand what it means. Look at the code for the plot to make sure you understand what it’s doing.

Caption: The red dot represents the mean difference in RT between young and old subjects. The histogram in black represents the distribution of 1000 randomly generated sample means with a mean of 0 and within 1 standard deviation of the observed difference in RT between young and old subjects.

means = tibble(simulated_mean=a)
ggplot(means, aes(x=simulated_mean)) +
  geom_histogram(bins=100) +
  xlab("Mean Difference in RT") + # fill in the x-axis label here
  ylab("Frequency") + # fill in the y-axis label here
  geom_point(aes(x=rt.diff.mean, y=0), size=1, colour="red")

What does this plot tell us? Could the difference between the young RTs and the old RTs plausibly be due to chance?

The red dot is the observed (actual) mean difference in RT between old and young subjects (for all the words). If this difference is simply due to chance, it should be very close to (or within the range of) the histogram representing the randomly generated sample means. However, the red dot is far from that range. This suggests that the difference between young RTs and old RTs is not due to chance; there’s actually a difference between young people’s RT and old people’s RT.

11 Plot WrittenCount against RT

Using ggplot, plot RT as a function of WrittenCount. This means WrittenCount on the x-asis, RT on the y-axis.

ggplot(d, aes(x=WrittenCount, y=RT)) +
  geom_point()

12 Plot log(WrittenCount) against log(RT)

Using ggplot, plot log(RT) as a function of log(WrittenCount). This means log(WrittenCount) on the x-asis, log(RT) on the y-axis.

ggplot(d, aes(x=log(WrittenCount), y=log(RT))) +
  geom_point()

Compare the plots in Q11 and Q12. If you were to plot a “line of best fit,” which one would be a better fit to a straight line?

If I were to plot a line of best fit, the plot in Q12 would be a better fit to a straight line. The relationship between log(WrittenCount) and log(RT) seems more linear than the relationship between WrittenCount and RT.

13 Testing the effect of WrittenCount and RT

Propose a method, using simulation like in Q10, for assessing whether there is a significant effect of WrittenCount on RT.

To assess whether there’s a significant effect of WrittenCount on RT, we could first group the data based on value of Written Count and find the mean RT for each value of Written Count. Then, we could generate simulated means of mean RTs with the mean of mean RT and sd of mean RT. Finally, we make a plot of the simulated means and compare it to the overall mean RT in the data. If the overall mean RT in the data is far away from the simulated means, then the effect is significant.

EXTRA CREDIT: Implement this method and draw a conclusion!

mean.RT = mean(d$RT); mean.RT

## [1] 708.1572

# group the data by WrittenCount
# and find the mean RT for each value of WrittenCount
df.wc = group_by(d, WrittenCount) %>%
  summarise(RT.mean = mean(RT)); df.wc

## # A tibble: 882 x 2
##    WrittenCount RT.mean
##           <dbl>   <dbl>
##  1         1       808.
##  2         2.00    783.
##  3         3.      824.
##  4         4.00    791.
##  5         5.00    817.
##  6         6.00    865.
##  7         7.00    851.
##  8         8.      805.
##  9         9.      819.
## 10        10.      811.
## # … with 872 more rows

# find the overall mean and sd of mean RTs
mean.RT.mean = mean(df.wc$RT.mean); mean.RT.mean

## [1] 669.7288

mean.RT.sd = sd(df.wc$RT.mean); mean.RT.sd

## [1] 48.70726

wc.nrow = nrow(df.wc); wc.nrow

## [1] 882

# randomly generate 1000 simulated means
# with the mean being mean of mean RTs
# and the sd being sd of mean RTs
b = NULL
for (i in 1:1000) {
  r.norm = rnorm(wc.nrow, mean.RT.mean, mean.RT.sd)
  r.norm.mean  = mean(r.norm) # take the mean of r.norm
  b = append(b, r.norm.mean)
}
hist(b)

simulated.means = tibble(simulated_mean=b)

ggplot(simulated.means, aes(x=simulated_mean)) +
  geom_histogram() +
  geom_vline(xintercept = mean.RT)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Ling 104 Problem Set 2: Means and Models

Lexie Wang

3/29/2021