In the first part of this project, you will read in some data from a lexical decision time task for English. You can read about it here, on pages 30-31, for the data set called “english.” (This documentation is the standard documentation for R packages, so it is good to get familiar with reading it.)
That data is saved in a csv called “english.csv” Below, I read in the data for you, as a data frame called d.
Here is how a lexical decision task works. In a lexical decision task, a subject is presented with a word (like “dog”) or a plausible non-word (like “florp”) and asked to judge as quickly as possible whether or not what they saw is a word.
How fast people can make a decision reflects something about the psychological response of the subject to the word in question. In this part of the problem set, you will find out what sorts of things are predictive of lexical decision time.
Below, we read in the data set and make a column called “RT,” which is the average time in milliseconds that it took participants to respond to a given word.
d = read_csv("english.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Word = col_character(),
## AgeSubject = col_character(),
## WordCategory = col_character(),
## CV = col_character(),
## Obstruent = col_character(),
## Frication = col_character(),
## Voice = col_character()
## )
## See spec(...) for full column specifications.
d$RT = exp(d$RTlexdec)
head(d)
## # A tibble: 6 x 37
## RTlexdec RTnaming Familiarity Word AgeSubject WordCategory WrittenFrequency
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 6.54 6.15 2.37 doe young N 3.91
## 2 6.30 6.14 5.6 stre… young N 6.51
## 3 6.42 6.13 3.87 pork young N 5.02
## 4 6.45 6.20 3.93 plug young N 4.89
## 5 6.53 6.17 3.27 prop young N 4.77
## 6 6.37 6.12 3.73 dawn young N 6.38
## # … with 30 more variables: WrittenSpokenFrequencyRatio <dbl>,
## # FamilySize <dbl>, DerivationalEntropy <dbl>, InflectionalEntropy <dbl>,
## # NumberSimplexSynsets <dbl>, NumberComplexSynsets <dbl>,
## # LengthInLetters <dbl>, Ncount <dbl>, MeanBigramFrequency <dbl>,
## # FrequencyInitialDiphone <dbl>, ConspelV <dbl>, ConspelN <dbl>,
## # ConphonV <dbl>, ConphonN <dbl>, ConfriendsV <dbl>, ConfriendsN <dbl>,
## # ConffV <dbl>, ConffN <dbl>, ConfbV <dbl>, ConfbN <dbl>,
## # NounFrequency <dbl>, VerbFrequency <dbl>, CV <chr>, Obstruent <chr>,
## # Frication <chr>, Voice <chr>, FrequencyInitialDiphoneWord <dbl>,
## # FrequencyInitialDiphoneSyllable <dbl>, CorrectLexdec <dbl>, RT <dbl>
Using the filter command, filter to just the rows for which the “Word” is dog. There should be two rows: one for “young” subjects and one for “old” subjects.
Remember: the RT column tells you the average lexical decision time in milliseconds.
d_dog <- filter(d, Word == 'dog'); d_dog
## # A tibble: 2 x 37
## RTlexdec RTnaming Familiarity Word AgeSubject WordCategory WrittenFrequency
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 6.27 6.10 5.67 dog young N 7.16
## 2 6.57 6.42 5.67 dog old N 7.16
## # … with 30 more variables: WrittenSpokenFrequencyRatio <dbl>,
## # FamilySize <dbl>, DerivationalEntropy <dbl>, InflectionalEntropy <dbl>,
## # NumberSimplexSynsets <dbl>, NumberComplexSynsets <dbl>,
## # LengthInLetters <dbl>, Ncount <dbl>, MeanBigramFrequency <dbl>,
## # FrequencyInitialDiphone <dbl>, ConspelV <dbl>, ConspelN <dbl>,
## # ConphonV <dbl>, ConphonN <dbl>, ConfriendsV <dbl>, ConfriendsN <dbl>,
## # ConffV <dbl>, ConffN <dbl>, ConfbV <dbl>, ConfbN <dbl>,
## # NounFrequency <dbl>, VerbFrequency <dbl>, CV <chr>, Obstruent <chr>,
## # Frication <chr>, Voice <chr>, FrequencyInitialDiphoneWord <dbl>,
## # FrequencyInitialDiphoneSyllable <dbl>, CorrectLexdec <dbl>, RT <dbl>
# The average RT of the young subjects
filter(d_dog, AgeSubject == 'young') %>%
select(RT)
## # A tibble: 1 x 1
## RT
## <dbl>
## 1 527.
# The average RT of the old subjects
filter(d_dog, AgeSubject == 'old') %>%
select(RT)
## # A tibble: 1 x 1
## RT
## <dbl>
## 1 715.
The average RT of the old subjects is 715.13. The average RT of the young subjects is 526.82.
Make a histogram using hist(), with breaks=30, that looks at the distribution of RTs across the entire data set.
hist(d$RT, breaks = 30)
How many peaks do you see? Two.
By filtering the data, now make a histogram for just the young people and for just the old people.
# Histogram for just the young people
hist(filter(d, AgeSubject == 'young')$RT, breaks=30)
# Histogram for just the old people
hist(filter(d, AgeSubject == 'old')$RT, breaks=30)
Who is faster overall: the young people or the old people? Young people are faster overall.
Do you see why there were two peaks in Q2 above? There are two peaks because one peak is for young people and the other peak is for old people. Based on the two histograms for young and old people respectively in question 3, the reaction time of young people is shorter overall. Thus, for the entire dataset, there will be two peaks, as the peaks for young and old people don’t overlap. The first peak is for young people, and the second peak is for old people.
Besides the age of the subjects, we now want to understand something about what makes a particular word fast or slow for people to respond to it.
The column Familiarity contains average ratings for how familiar the words are. A word with a familiartiy rating of 7 is rated as very familiar to everyone. A word with a familarity rating of 1 is totally unfamiliar to everyone.
Using either plot() or ggplot(), make a plot that represents each data point in the data frame with Familiarity on the x-axis and RT on the y-axis.
What do you see? In general, the average lexical decision time is lower for words with higher familiarity rating, and the average lexical decision time is higher for words with lower familiarity rating. In addition, there seems to be two layers of dots with a small gap in the middle, especially for words with high familiarity ratings. This is probably because of the effect of age - the bottom layer represents young people, while the upper layer represents old people. For words with lower familiarity ratings, this difference becomes less obvious.
ggplot(d, aes(x = Familiarity, y = RT)) +
geom_point()
Now let’s look at RT as a function of CorrectLexdec. (This means CorrectLexdec goes on the x-axis and RT on the y-axis.) CorrectLexdec is how many people (out of 30) correctly identified that the word was a word. For instance, CorrectLexdec for “doe” for young people is 27. This means 27/30 people knew that “doe” was an English word. 3 got it wrong and said it was not a word.
Ok, here’s our plot.
plot(d$CorrectLexdec, d$RT)
Hmm, this plot is kind of hard to see what’s going on. I think it will be easier if we first get the average RT for each level of CorrectLexdec and then plot it.
Using group_by() and summarize(), make a new data frame called d.lexdec in which you find the mean RT for each level of CorrectLexdec. That is, it will have a row for CorrectLexdec == 30, in which it says the average RT for all words for which CorrectLexdec == 30.
d.lexdec <-
group_by(d, CorrectLexdec) %>%
summarise(mean.RT = mean(RT))
d.lexdec
## # A tibble: 30 x 2
## CorrectLexdec mean.RT
## <dbl> <dbl>
## 1 1 654.
## 2 2 743.
## 3 3 720.
## 4 4 822.
## 5 5 727.
## 6 6 809.
## 7 7 761.
## 8 8 784
## 9 9 718.
## 10 10 829.
## # … with 20 more rows
Now that you made this data frame d.lexdec, make a plot in which CorrectLexdec is on the x-axis and the y-axis is the mean RT. This plot should have one data point per x-value, not many as in the plot above.
plot(d.lexdec$CorrectLexdec, d.lexdec$mean.RT)
Describe the pattern you see: For the right half of the plot (approximately CorrectLexdec > 15), the mean reaction time decreases as CorrectLexdec increases; when CorrectLexdec reaches 27, the mean reaction time remains relatively unchanged. For the left hald of the plot (approximately CorrectLexdec < 15), the dots are more scattered.
Extra credit: What is causing this pattern!? When CorrectLexdec > 15, most people are able to correctly identify whether the word is a word. It’s possible that when more people are able to make a correct decision, the decision is a easier one to make (i.e. it’s easier to identify whether the word is a real word). When more than 27 people out of 30 people are able to make a correct decision, it could be very easy to identify whether the word is a real word, so the average reaction times are very low, at the same level. However, when CorrectLexdec < 15, most people are unable to make the correct decision. The mean reaction time varies a lot as some people may spend longer time thinking.
Give a brief overall summary of your analysis of this data set, based on what we have seen so far.
In general, young people react faster than old people. Overall, mean reaction time decreases as familiarity rating increases; however, the difference between young people and old people’s reaction time increases as familiarity rating increases, and this difference is less obvious when familiarity rating is low. When CorrectLexdec > 15, the mean reaction time follows the pattern of normal distribution, but when CorrectLexdec < 15, the mean reaction time varies a lot.
In this second part of the problem set, you will analyze the dative alternation in English.
The dative alternation refers to the fact that English has two options for conveying benefactive sentences: “Alex gave Terry the cake.” or “Alex gave the cake to Terry.” We call the first of these the NP realization (because there are two noun phrases after the verb). We call the second of these the PP realization, because there is a prepositional phrase.
It turns out that there is rich and fascinating statistical structure in which of these choices we make when we are speaking!
The data set dative.csv contains some data on which choice we make. First, some terminology:
The Recipient (abbreviated Rec in column names in our data set) in these sentences is the person who gets something. In the above, it’s Terry since Terry gets the cake. The “theme” is the thing that gets given: the cake.
The dependent variable is RealizationOfRecipient, and it is NP for the NP case and PP for the PP case.
To learn about the other variables, read pages 21-22 here.
Read in the data set dative.csv.
dative <- read_csv('dative.csv')
## Parsed with column specification:
## cols(
## Speaker = col_character(),
## Modality = col_character(),
## Verb = col_character(),
## SemanticClass = col_character(),
## LengthOfRecipient = col_double(),
## AnimacyOfRec = col_character(),
## DefinOfRec = col_character(),
## PronomOfRec = col_character(),
## LengthOfTheme = col_double(),
## AnimacyOfTheme = col_character(),
## DefinOfTheme = col_character(),
## PronomOfTheme = col_character(),
## RealizationOfRecipient = col_character(),
## AccessOfRec = col_character(),
## AccessOfTheme = col_character()
## )
Is this data tidy? Why or why not?
This data is tidy. This is because each variable forms a column, each observation forms a row, and each type of observational unit forms a table.
Let’s think about LengthOfTheme and LengthOfRecipient (the number of words in the theme and recipient, respectively).
In one line of code, show the proportion of the time that LengthOfTheme is longer than LengthOfRecipient.
mean(dative$LengthOfTheme > dative$LengthOfRecipient)
## [1] 0.6693227
Using group_by() and summarise(), make two new data frames: one showing the proportion of the time the RealizationOfRecipient is “NP” as a function of the LengthOfTheme, another as a function of LengthOfRecipient. Make a plot for each, where the proportion of the time RealizationOfRecipient is “NP” is the value on the y-axis.
dative_df1 <-
group_by(dative, LengthOfTheme) %>%
summarise(Proportion = mean(RealizationOfRecipient == 'NP'))
dative_df2 <-
group_by(dative, LengthOfRecipient) %>%
summarise(Proportion = mean(RealizationOfRecipient == 'NP'))
ggplot(dative_df1, aes(x = LengthOfTheme, y = Proportion)) +
geom_point()
ggplot(dative_df2, aes(x = LengthOfRecipient, y = Proportion)) +
geom_point()
Describe what you learned from the graphs in 12. Why do you think we see these patterns?
From the graphs above, we can see that when length of theme is very low, it’s less likely for the realization of recipient to be NP. As length of theme increases, it becomes more likely for the realization of recipient to be NP. When length of theme reaches 17, realization of recipient is almost entirely NP (with one outlier). This could be possibly explained by the principle of end weight, as speakers have a strong tendency to place the longer phrase at the end. When the theme is long, speakers tend to use the NP structure and place the theme at the end of the sentence.
From the other graph, we can see that as length of recipient increases, proportion of NP realizations of recipient decreases (with some outliers). As length of recipient increases, it becomes more likely for the realization of recipient to be PP. When length of recipient is higher than 20, the realization of recipient is entirely PP. Still, this can be explained by the principle of end weight. When the recipient is long, speakers tend to use the PP structure and place the recipient at the end of the sentence, after the theme.
Now, find the proportion of the time each verb is realized as NP, by grouping on Verb. Rank them in order of most NP-realized to least NP-realized verbs.
dative_verb <-
group_by(dative, Verb) %>%
summarise(Proportion = mean(RealizationOfRecipient == 'NP'))
arrange(dative_verb, desc(Proportion))
## # A tibble: 75 x 2
## Verb Proportion
## <chr> <dbl>
## 1 accord 1
## 2 allow 1
## 3 assess 1
## 4 assure 1
## 5 bet 1
## 6 cost 1
## 7 fine 1
## 8 flip 1
## 9 float 1
## 10 guarantee 1
## # … with 65 more rows
The next couple questions are a bit more open-ended and give you a chance to come up with your own analyses (an important skill for quantitative linguists!).
Give an analysis that you find convincing that explores the effect of Modality (written or spoken) on the RealizationOfRecipient.
dative_modality <-
group_by(dative, Modality) %>%
summarise(Proportion = mean(RealizationOfRecipient == 'NP'))
dative_modality
## # A tibble: 2 x 2
## Modality Proportion
## <chr> <dbl>
## 1 spoken 0.788
## 2 written 0.615
When the modality is spoken, it’s more likely for the realization of recipient to be NP than when modality is spoken. Either way (spoken or written), NP realization of recipient is more likely than PP realization.
Pick 3 variables (columns) that we did not discuss. For each one (or, if you want to get fancy, for combinations of them!) do some basic analyses of the sort we did above. And make some graphs. Based on the graphs and analyses, (a) describe the variables and what they mean (giving examples). And (b) discuss your analysis and conclusions.
Remember, we are building towards a final project where you do your own analyses so this is good practice.
dative_Animacy <-
group_by(dative, AnimacyOfRec) %>%
summarise(Proportion = mean(RealizationOfRecipient == 'NP'))
dative_Animacy
## # A tibble: 2 x 2
## AnimacyOfRec Proportion
## <chr> <dbl>
## 1 animate 0.760
## 2 inanimate 0.481
Animacy_plot <-
ggplot(data=dative_Animacy, aes(x=AnimacyOfRec, y=Proportion)) +
geom_bar(stat="identity")+theme_minimal()
Animacy_plot
Animacy expresses how sentient or alive the referent of a noun is. Animacy of recipient can be animate or inanimate. One example of an animate referent is cat, and one example of an inanimate referent is stone. Based on the data and graph above, we can see that when recipient is animate, NP realization of recipient is much more preferred than PP realization. When recipient is inanimate, PP realization of recipient is slightly preferred over NP realization. Comparing between animate recipient and inanimate recipient, when recipient is animate, realization of recipient is more likely to be NP.
dative_Pronominality <-
group_by(dative, PronomOfRec) %>%
summarise(Proportion = mean(RealizationOfRecipient == 'NP'))
dative_Pronominality
## # A tibble: 2 x 2
## PronomOfRec Proportion
## <chr> <dbl>
## 1 nonpronominal 0.488
## 2 pronominal 0.892
Based on the data above, there is an effect of pronominality of recipient on realization of recipient. Pronominality of recipient can be pronominal or nonpronominal. Pronominality distinguishes phrases headed by pronouns from those headed by nonpronouns (like nouns). An example of a pronominal recipient is ‘her.’ Based on the table above, when the recipient is pronominal, 89% of the realization of recipients is NP. When recipient is nonpronominal, only 48.8% of the realizations is NP. This suggests that when the recipient is pronominal, it’s more likely for the realization of recipient to be NP, rather than PP.
dative_SemanticClass <-
group_by(dative, SemanticClass) %>%
summarise(Proportion = mean(RealizationOfRecipient == 'NP'))
dative_SemanticClass
## # A tibble: 5 x 2
## SemanticClass Proportion
## <chr> <dbl>
## 1 a 0.821
## 2 c 0.877
## 3 f 0.797
## 4 p 0.987
## 5 t 0.537
scp <-
ggplot(data=dative_SemanticClass, aes(x=SemanticClass, y=Proportion, fill=SemanticClass)) +
geom_bar(stat="identity")+theme_minimal()
scp
Semantic class can be a (abstract: ’give it some thought’), c (communication: ’tell, give me your name’), f (future transfer of possession: ’owe, promise’), p (prevention of possession: ’cost, deny’), and t (transfer of possession: ’give an armband, send’) (Baayen & Shafaei-Bajestan, 2019). Based on the graph above, when semantic class is p (prevention of possession), the realization of recipient is almost entirely NP. In contrast, when semantic class is t (transfer of possession), the realization of recipient is about evenly distributed between NP and PP, with NP being the slightly more likely realization. In general, NP realization of recipient is more likely regardless of semantic class.
About how long did this problem set take you? What was your experience doing it, and what would you like more help in understanding? Also feel free to share any feedback on the course.
I’d like more help with exercise questions of this kind. This problem set took me a lot of time, much longer than I expected, but I couldn’t recall how many hours it took.