Do L1 and L2 speakers differ in their semantic spaces for the same English words? Here, we address this question by asking whether properties of the cue predict differences in associates that speakers produce, comparing L1 and L2 speakers. In particular, we look at whether the sentiment and concreteness of the cue predicts divergence, controlling for the spoken frequency of the word.
To measure divergence, we look at the number of associates for each cue-associate pair in each language.
SUMMARY:
d.clean = read.csv("../data/dclean.csv")
This is a measure of divergence for a cue between L1 and L2.
From Manning and Schuetze (1999; pg. 166): Discover co-occurences that best distinguish between words within the same corpus. This is slightly different than our goal, because here we want to compare across corpora, and thus our N’s are different. Here, I modify their formula to account for this (analagous to Welch’s two-sample t-test).
Manning and Schuetze: \[ t = \frac{\frac{C(v_1w)}{N}-\frac{C(v_2w)}{N}} {\sqrt{\frac{C(v_1w) + C(v_2w)}{N^2}}} \]
Welch’s version: \[ t = \frac{\frac{C(v_1w)}{n_1}-\frac{C(v_2w)}{n_2}} {\sqrt{\frac{C(v_1w)}{n_1} + \frac{C(v_2w)}{n_2}}} \]
So suppose we want to figure out how divergent responses are for the cue “zucchini” between native and non-native speakers. For each associate that participants produced for this cue, we’d compare the relative frequency between native and non-native. For example, let’s look at the frequency of producing “vegetable” as an associate for zucchini. Native speakers produced “vegetable” in response to zucchini 37 times and non-native 5. We also need the base rates of producing vegetable as an associate. Native speakers produced vegetable as an associate 161 times and non-native speakers produced vegetable 51 times. So, we have:
Cue = “zucchini” (z)
Associate = “vegetable” (v)
\[ C(zv)_{native}=37\\ C(zv)_{non-native}=5\\ C(v)_{native}= 161\\ C(v)_{non-native} = 51 \]
\[ t = \frac{\frac{37}{161}-\frac{5}{51}} {\sqrt{\frac{37}{161} + \frac{5}{51}}} = .23 \]
We then do this for every associate that was produced for zuchinni, and take the mean t-score across associates. This is our estimate of how “divergent” responses are between native and non-native speakers for a given cue. Note that POSITIVE values indicate that associates were produced relatively more by native speakers on average, and NEGATIVE values indiate that associates were were produced relatively more on average by non-native speakers.
Calculate t-scores
# get dataset with shared bigrams only (i think this is necessary)
native.bigrams = unique(filter(d.clean,native.lang == "L1"))
non.native.bigrams = unique(filter(d.clean,native.lang == "L2"))
shared.bigrams = intersect(native.bigrams$bigram,
non.native.bigrams$bigram)
d.common = d.clean %>%
ungroup() %>%
filter(bigram %in% shared.bigrams)
# get C(vw) in each language (bigram counts)
bigram.counts = d.common %>%
group_by(native.lang, bigram, associate, cue) %>%
summarize(n_bigram = n())
# get C(v) in each language (associate counts)
associate.counts = d.common %>%
group_by(native.lang, associate) %>%
summarize(n_associate = n())
# get C(vw)/C(v) in each langauge
bigram.counts.rf = bigram.counts %>%
left_join(associate.counts, by = c("native.lang", "associate")) %>%
mutate(rf = n_bigram/n_associate)
t.scores <- bigram.counts.rf %>%
ungroup() %>%
as_tibble() %>%
select(native.lang, rf, cue, associate) %>%
spread(native.lang, rf) %>%
filter(!is.na(L1), !is.na(L2)) %>%
mutate(t = (L1 - L2)/sqrt(L1 + L2))
ggplot(t.scores, aes(x = t)) +
geom_histogram() +
theme_bw() +
ggtitle("distribution of t-scores")
ggplot(t.scores, aes(x = abs(t))) +
geom_histogram() +
theme_bw() +
ggtitle("distribution of t-scores")
More divergence for high frequency items. But clearly not the whole effect.
As a control (from http://subtlexus.lexique.org/). - note we’re losing some of our sentiment norms below because we don’t have freqeuency for all the items - let’s think about a better source here.
SUMMARY: Less divergence for high frequency items (there’s only an effect for abs_t). But clearly not the whole effect.
Join t.scores and cues characteristics
cues.chars = d.clean %>%
group_by(cue) %>%
slice(1) %>%
select(Lg10WF, quant.sent, Conc.M, V.Mean.Sum, A.Mean.Sum, D.Mean.Sum)
t.scores.full = t.scores %>%
group_by(cue) %>%
summarize(t = mean(t),
t_abs = mean(abs(t))) %>%
left_join(cues.chars)
correlate(t.scores.full %>% select(2:4)) %>%
shave() %>%
fashion() %>%
kable()
| rowname | t | t_abs | Lg10WF |
|---|---|---|---|
| t | |||
| t_abs | -.23 | ||
| Lg10WF | -.02 | -.05 |
cor.test(t.scores.full$t, t.scores.full$Lg10WF) %>%
tidy() %>%
select(method, estimate, statistic, p.value) %>%
kable()
| method | estimate | statistic | p.value |
|---|---|---|---|
| Pearson’s product-moment correlation | -0.0164701 | -1.581519 | 0.1137938 |
cor.test(t.scores.full$t_abs, t.scores.full$Lg10WF) %>%
tidy() %>%
select(method, estimate, statistic, p.value) %>%
kable()
| method | estimate | statistic | p.value |
|---|---|---|---|
| Pearson’s product-moment correlation | -0.0466462 | -4.483405 | 7.4e-06 |
ggplot(t.scores.full, aes(x = Lg10WF, y = t)) +
geom_smooth(method = "lm") +
theme_bw()
ggplot(t.scores.full, aes(x = Lg10WF, y = t_abs)) +
geom_smooth(method = "lm") +
theme_bw()
Now let’s see if we can predict the t-score based on characteristics of the cue, controlling for frequency. Three sources of sentiment norms: (1) Warriner et al. (2013; N = 13,915), (2) ANEW (N = 1,030), and (3) Dodds et al. (2011; N = 10,192).
Valence, arousal and dominance norms for 13,915 lemmas total (Warriner et al., 2013). This is the largest set of emotion norms we have. Scales: unhappy/calm to happy/excited.
SUMMARY: For t, there is more divergence for words that are high arousal (no effect of valence). For abs t, there’s more digerence for words that have low valence (marginal).
cues.ts = t.scores.full %>%
select(cue, t, t_abs, Lg10WF,V.Mean.Sum,A.Mean.Sum, D.Mean.Sum) %>%
distinct() %>%
filter(!is.na(Lg10WF) & !is.na(V.Mean.Sum))
Leaving us with 7380 cues with both frequency and sentiment norms.
Look at correlation between norms
correlate(cues.ts %>% select(-1,-3,-8)) %>%
shave() %>%
fashion() %>%
kable()
| rowname | t | Lg10WF | V.Mean.Sum | A.Mean.Sum | D.Mean.Sum |
|---|---|---|---|---|---|
| t | |||||
| Lg10WF | -.00 | ||||
| V.Mean.Sum | .01 | .15 | |||
| A.Mean.Sum | -.03 | .04 | -.17 | ||
| D.Mean.Sum | .01 | .13 | .72 | -.17 |
kable(tidy(lm(t~ V.Mean.Sum + Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.0154603 | 0.0023015 | -6.717520 | 0.0000000 |
| V.Mean.Sum | 0.0004325 | 0.0003472 | 1.245518 | 0.2129808 |
| Lg10WF | -0.0001368 | 0.0006678 | -0.204854 | 0.8376918 |
ggplot(cues.ts, aes(x = V.Mean.Sum, y = t)) +
geom_smooth(method = "lm") +
theme_bw()
kable(tidy(lm(t ~ A.Mean.Sum+ Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.0088229 | 0.0026086 | -3.3822997 | 0.0007226 |
| A.Mean.Sum | -0.0011466 | 0.0004888 | -2.3459372 | 0.0190056 |
| Lg10WF | 0.0000506 | 0.0006606 | 0.0765332 | 0.9389970 |
ggplot(cues.ts, aes(x = A.Mean.Sum, y = t)) +
geom_smooth(method = "lm") +
theme_bw()
Look at correlation between norms
correlate(cues.ts %>% select(-1:-2,-8)) %>%
shave() %>%
fashion() %>%
kable()
| rowname | t_abs | Lg10WF | V.Mean.Sum | A.Mean.Sum | D.Mean.Sum |
|---|---|---|---|---|---|
| t_abs | |||||
| Lg10WF | -.05 | ||||
| V.Mean.Sum | -.03 | .15 | |||
| A.Mean.Sum | -.00 | .04 | -.17 | ||
| D.Mean.Sum | -.01 | .13 | .72 | -.17 |
kable(tidy(lm(t_abs~ V.Mean.Sum + Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0389883 | 0.0014832 | 26.286673 | 0.0000000 |
| V.Mean.Sum | -0.0004252 | 0.0002238 | -1.900076 | 0.0574622 |
| Lg10WF | -0.0015951 | 0.0004304 | -3.706353 | 0.0002118 |
ggplot(cues.ts, aes(x = V.Mean.Sum, y = t_abs)) +
geom_smooth(method = "lm") +
theme_bw()
kable(tidy(lm(t_abs~ A.Mean.Sum + Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0369054 | 0.0016820 | 21.9420223 | 0.0000000 |
| A.Mean.Sum | 0.0000388 | 0.0003152 | 0.1231376 | 0.9020015 |
| Lg10WF | -0.0017195 | 0.0004260 | -4.0367243 | 0.0000548 |
ggplot(cues.ts, aes(x = A.Mean.Sum, y = t_abs)) +
geom_smooth(method = "lm") +
theme_bw()
Valence, arousal and dominance norms for 1000 lemmas total.
SUMMARY: No effects across valence, arousal, and dominance.
anew = read.csv("data/anew.csv") %>%
select(Description, Valence.Mean, Arousal.Mean, Dominance.Mean)
cues.ts = t.scores.full %>%
left_join(anew, by=c("cue"="Description")) %>%
select(cue, t, t_abs, Lg10WF,Valence.Mean,Arousal.Mean, Dominance.Mean) %>%
distinct() %>%
filter(!is.na(Lg10WF) & !is.na(Arousal.Mean))
There are 897 cues with both frequency and sentiment norms.
Look at correlation between norms
correlate(cues.ts %>% select(-1,-3,-8)) %>%
shave() %>%
fashion() %>%
kable()
| rowname | t | Lg10WF | Valence.Mean | Arousal.Mean | Dominance.Mean |
|---|---|---|---|---|---|
| t | |||||
| Lg10WF | .08 | ||||
| Valence.Mean | .03 | .20 | |||
| Arousal.Mean | -.04 | .10 | -.05 | ||
| Dominance.Mean | .02 | .15 | .84 | .07 |
kable(tidy(lm(t~ Valence.Mean + Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.0279249 | 0.0055489 | -5.0325246 | 0.0000006 |
| Valence.Mean | 0.0003473 | 0.0006191 | 0.5610354 | 0.5749140 |
| Lg10WF | 0.0037419 | 0.0017382 | 2.1527998 | 0.0316014 |
ggplot(cues.ts, aes(x = Valence.Mean, y = t)) +
geom_smooth(method = "lm") +
theme_bw()
kable(tidy(lm(t ~ Arousal.Mean+ Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.0187853 | 0.0073935 | -2.540792 | 0.0112282 |
| Arousal.Mean | -0.0016813 | 0.0011457 | -1.467482 | 0.1425968 |
| Lg10WF | 0.0041850 | 0.0017096 | 2.447906 | 0.0145602 |
ggplot(cues.ts, aes(x = Arousal.Mean, y = t)) +
geom_smooth(method = "lm") +
theme_bw()
kable(tidy(lm(t ~ Dominance.Mean+ Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.0281081 | 0.0073504 | -3.8240358 | 0.0001404 |
| Dominance.Mean | 0.0003243 | 0.0012009 | 0.2700417 | 0.7871905 |
| Lg10WF | 0.0038670 | 0.0017229 | 2.2445057 | 0.0250438 |
ggplot(cues.ts, aes(x = Dominance.Mean, y = t)) +
geom_smooth(method = "lm") +
theme_bw()
Look at correlation between norms
correlate(cues.ts %>% select(-1:-2,-8)) %>%
shave() %>%
fashion() %>%
kable()
| rowname | t_abs | Lg10WF | Valence.Mean | Arousal.Mean | Dominance.Mean |
|---|---|---|---|---|---|
| t_abs | |||||
| Lg10WF | -.08 | ||||
| Valence.Mean | -.03 | .20 | |||
| Arousal.Mean | .01 | .10 | -.05 | ||
| Dominance.Mean | .00 | .15 | .84 | .07 |
kable(tidy(lm(t_abs~ Valence.Mean + Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0398593 | 0.0036508 | 10.9179275 | 0.0000000 |
| Valence.Mean | -0.0001413 | 0.0004073 | -0.3468943 | 0.7287524 |
| Lg10WF | -0.0027116 | 0.0011436 | -2.3710972 | 0.0179465 |
ggplot(cues.ts, aes(x = Valence.Mean, y = t_abs)) +
geom_smooth(method = "lm") +
theme_bw()
kable(tidy(lm(t_abs ~ Arousal.Mean+ Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0374192 | 0.0048690 | 7.6852599 | 0.0000000 |
| Arousal.Mean | 0.0004116 | 0.0007545 | 0.5455187 | 0.5855329 |
| Lg10WF | -0.0028517 | 0.0011259 | -2.5328693 | 0.0114833 |
ggplot(cues.ts, aes(x = Arousal.Mean, y = t_abs)) +
geom_smooth(method = "lm") +
theme_bw()
kable(tidy(lm(t_abs ~ Dominance.Mean+ Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0374979 | 0.0048350 | 7.7555465 | 0.0000000 |
| Dominance.Mean | 0.0004189 | 0.0007900 | 0.5303054 | 0.5960319 |
| Lg10WF | -0.0028815 | 0.0011333 | -2.5425823 | 0.0111713 |
ggplot(cues.ts, aes(x = Dominance.Mean, y = t_abs)) +
geom_smooth(method = "lm") +
theme_bw()
SUMMARY: For t, there’s a marginal effect of happiness_average and happines rank: more divergence for less happy words. For abs-t, there’s a significant effect of happiness_rank: more divergence for less happy words.
dodds = read.table("data/dodds_happiness.txt", sep = "\t", header = T, na.strings = "--") %>%
select(word, happiness_rank, happiness_average, happiness_standard_deviation)
cues.ts = t.scores.full %>%
left_join(dodds, by=c("cue"="word")) %>%
select(cue, t, t_abs, Lg10WF,happiness_rank, happiness_average,
happiness_standard_deviation) %>%
distinct() %>%
filter(!is.na(Lg10WF) & !is.na(happiness_average))
# NOTE: figure out where the missing ones are going?
There are 4769 cues with both frequency and sentiment norms.
Look at correlation between norms
correlate(cues.ts %>% select(-1,-3,-8)) %>%
shave() %>%
fashion() %>%
kable()
| rowname | t | Lg10WF | happiness_rank | happiness_average | happiness_standard_deviation |
|---|---|---|---|---|---|
| t | |||||
| Lg10WF | .08 | ||||
| happiness_rank | -.03 | -.02 | |||
| happiness_average | .03 | .03 | -.95 | ||
| happiness_standard_deviation | -.02 | -.17 | .12 | -.12 |
kable(tidy(lm(t ~ happiness_rank+ Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.0266206 | 0.0024783 | -10.741443 | 0.0000000 |
| happiness_rank | -0.0000003 | 0.0000002 | -1.715569 | 0.0863060 |
| Lg10WF | 0.0041240 | 0.0007649 | 5.391494 | 0.0000001 |
ggplot(cues.ts, aes(x = happiness_rank, y = t)) +
geom_smooth(method = "lm") +
theme_bw()
kable(tidy(lm(t~ happiness_average + Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.0323810 | 0.0032681 | -9.908226 | 0.0000000 |
| happiness_average | 0.0008116 | 0.0004340 | 1.870292 | 0.0615045 |
| Lg10WF | 0.0041097 | 0.0007650 | 5.371850 | 0.0000001 |
ggplot(cues.ts, aes(x = happiness_average, y = t)) +
geom_smooth(method = "lm") +
theme_bw()
kable(tidy(lm(t ~ happiness_standard_deviation+ Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.0262903 | 0.0038760 | -6.7827808 | 0.0000000 |
| happiness_standard_deviation | -0.0011065 | 0.0019087 | -0.5797419 | 0.5621161 |
| Lg10WF | 0.0040774 | 0.0007765 | 5.2512299 | 0.0000002 |
ggplot(cues.ts, aes(x = happiness_standard_deviation, y = t)) +
geom_smooth(method = "lm") +
theme_bw()
Look at correlation between norms
correlate(cues.ts %>% select(-1,-2,-8)) %>%
shave() %>%
fashion() %>%
kable()
| rowname | t_abs | Lg10WF | happiness_rank | happiness_average | happiness_standard_deviation |
|---|---|---|---|---|---|
| t_abs | |||||
| Lg10WF | -.03 | ||||
| happiness_rank | .03 | -.02 | |||
| happiness_average | -.02 | .03 | -.95 | ||
| happiness_standard_deviation | -.02 | -.17 | .12 | -.12 |
kable(tidy(lm(t_abs~ happiness_rank + Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0345614 | 0.0016216 | 21.312744 | 0.0000000 |
| happiness_rank | 0.0000002 | 0.0000001 | 1.997434 | 0.0458348 |
| Lg10WF | -0.0009465 | 0.0005005 | -1.891080 | 0.0586743 |
ggplot(cues.ts, aes(x = happiness_rank, y = t_abs)) +
geom_smooth(method = "lm") +
theme_bw()
kable(tidy(lm(t_abs~ happiness_average + Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0381736 | 0.0021388 | 17.847989 | 0.0000000 |
| happiness_average | -0.0004716 | 0.0002840 | -1.660424 | 0.0968950 |
| Lg10WF | -0.0009437 | 0.0005007 | -1.884861 | 0.0595094 |
ggplot(cues.ts, aes(x = happiness_average, y = t_abs)) +
geom_smooth(method = "lm") +
theme_bw()
kable(tidy(lm(t_abs ~ happiness_standard_deviation+ Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0391986 | 0.0025358 | 15.458222 | 0.0000000 |
| happiness_standard_deviation | -0.0021671 | 0.0012487 | -1.735465 | 0.0827232 |
| Lg10WF | -0.0011214 | 0.0005080 | -2.207619 | 0.0273183 |
ggplot(cues.ts, aes(x = happiness_standard_deviation, y = t_abs)) +
geom_smooth(method = "lm") +
theme_bw()
SUMMARY: For both t and abs-t, there’s the predicted effect on concreteness: Words that are less concrete have more divergence.
cues.ts = t.scores.full %>%
select(cue, t, t_abs, Lg10WF, Conc.M) %>%
distinct() %>%
filter(!is.na(Lg10WF) & !is.na(Conc.M))
There are 8526 cues with both frequency and concretness words.
kable(tidy(lm(t~Conc.M + Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.0170579 | 0.0019681 | -8.6673617 | 0.0000000 |
| Conc.M | 0.0011523 | 0.0003977 | 2.8974898 | 0.0037712 |
| Lg10WF | -0.0002412 | 0.0005261 | -0.4583734 | 0.6466959 |
ggplot(cues.ts, aes(x = Conc.M, y = t)) +
#geom_point() +
geom_smooth(method = "lm") +
theme_bw()
kable(tidy(lm(t_abs~ Conc.M + Lg10WF, data = cues.ts)))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0416727 | 0.0012614 | 33.036485 | 0.00e+00 |
| Conc.M | -0.0013854 | 0.0002549 | -5.435274 | 1.00e-07 |
| Lg10WF | -0.0014881 | 0.0003372 | -4.413100 | 1.03e-05 |
ggplot(cues.ts, aes(x = Conc.M, y = t_abs)) +
#geom_point() +
geom_smooth(method = "lm") +
theme_bw()