Do L1 and L2 speakers differ in their semantic spaces for the same English words? Here, we address this question by asking whether properties of the cue predict differences in associates that speakers produce, comparing L1 and L2 speakers. In particular, we look at whether the sentiment and concreteness of the cue predicts divergence, controlling for the spoken frequency of the word.

To measure divergence, we look at the number of associates for each cue-associate pair in each language.

SUMMARY:

  • High level question: things depend on whether you use t or abs-t.
  • For t, more divergence for words that are high arousal, low happiness, low concreteness.
  • For abs-t, more divergence for high frequency, low valence (marginal), low happiness, low concreteness.

Read in data.

d.clean = read.csv("../data/dclean.csv")

Get t-score

This is a measure of divergence for a cue between L1 and L2.

From Manning and Schuetze (1999; pg. 166): Discover co-occurences that best distinguish between words within the same corpus. This is slightly different than our goal, because here we want to compare across corpora, and thus our N’s are different. Here, I modify their formula to account for this (analagous to Welch’s two-sample t-test).

Manning and Schuetze: \[ t = \frac{\frac{C(v_1w)}{N}-\frac{C(v_2w)}{N}} {\sqrt{\frac{C(v_1w) + C(v_2w)}{N^2}}} \]

Welch’s version: \[ t = \frac{\frac{C(v_1w)}{n_1}-\frac{C(v_2w)}{n_2}} {\sqrt{\frac{C(v_1w)}{n_1} + \frac{C(v_2w)}{n_2}}} \]

So suppose we want to figure out how divergent responses are for the cue “zucchini” between native and non-native speakers. For each associate that participants produced for this cue, we’d compare the relative frequency between native and non-native. For example, let’s look at the frequency of producing “vegetable” as an associate for zucchini. Native speakers produced “vegetable” in response to zucchini 37 times and non-native 5. We also need the base rates of producing vegetable as an associate. Native speakers produced vegetable as an associate 161 times and non-native speakers produced vegetable 51 times. So, we have:

Cue = “zucchini” (z)

Associate = “vegetable” (v)

\[ C(zv)_{native}=37\\ C(zv)_{non-native}=5\\ C(v)_{native}= 161\\ C(v)_{non-native} = 51 \]

\[ t = \frac{\frac{37}{161}-\frac{5}{51}} {\sqrt{\frac{37}{161} + \frac{5}{51}}} = .23 \]

We then do this for every associate that was produced for zuchinni, and take the mean t-score across associates. This is our estimate of how “divergent” responses are between native and non-native speakers for a given cue. Note that POSITIVE values indicate that associates were produced relatively more by native speakers on average, and NEGATIVE values indiate that associates were were produced relatively more on average by non-native speakers.

Calculate t-scores

# get dataset with shared bigrams only (i think this is necessary)
native.bigrams = unique(filter(d.clean,native.lang == "L1"))
non.native.bigrams = unique(filter(d.clean,native.lang == "L2"))
shared.bigrams = intersect(native.bigrams$bigram, 
                           non.native.bigrams$bigram)

d.common = d.clean %>%
  ungroup() %>%
  filter(bigram %in% shared.bigrams)

# get C(vw) in each language (bigram counts)
bigram.counts = d.common %>%
  group_by(native.lang, bigram, associate, cue) %>%
  summarize(n_bigram = n())  

# get C(v) in each language (associate counts)
associate.counts = d.common %>%
  group_by(native.lang, associate) %>%
  summarize(n_associate = n())

# get C(vw)/C(v) in each langauge
bigram.counts.rf = bigram.counts %>%
  left_join(associate.counts, by = c("native.lang", "associate")) %>%
  mutate(rf = n_bigram/n_associate)
  
t.scores <- bigram.counts.rf %>%
  ungroup() %>%
  as_tibble() %>%
  select(native.lang, rf, cue, associate) %>%
  spread(native.lang, rf) %>%
  filter(!is.na(L1), !is.na(L2)) %>%
  mutate(t = (L1 - L2)/sqrt(L1 + L2))

t-scores

ggplot(t.scores, aes(x = t)) +
  geom_histogram() +
  theme_bw() +
  ggtitle("distribution of t-scores")

absolute t-scores

ggplot(t.scores, aes(x = abs(t))) +
  geom_histogram() +
  theme_bw() +
  ggtitle("distribution of t-scores")

More divergence for high frequency items. But clearly not the whole effect.

Predicting t with frequency

As a control (from http://subtlexus.lexique.org/). - note we’re losing some of our sentiment norms below because we don’t have freqeuency for all the items - let’s think about a better source here.

SUMMARY: Less divergence for high frequency items (there’s only an effect for abs_t). But clearly not the whole effect.

Join t.scores and cues characteristics

cues.chars = d.clean %>%
  group_by(cue) %>%
  slice(1) %>%
  select(Lg10WF, quant.sent, Conc.M, V.Mean.Sum, A.Mean.Sum, D.Mean.Sum)

t.scores.full = t.scores %>%
  group_by(cue) %>%
  summarize(t = mean(t),
            t_abs = mean(abs(t))) %>%
  left_join(cues.chars) 
correlate(t.scores.full %>%  select(2:4)) %>%
  shave() %>%
  fashion() %>%
  kable()
rowname t t_abs Lg10WF
t
t_abs -.23
Lg10WF -.02 -.05
cor.test(t.scores.full$t, t.scores.full$Lg10WF) %>% 
  tidy() %>%
  select(method, estimate, statistic, p.value) %>%
  kable()
method estimate statistic p.value
Pearson’s product-moment correlation -0.0164701 -1.581519 0.1137938
cor.test(t.scores.full$t_abs, t.scores.full$Lg10WF) %>% 
  tidy() %>%
  select(method, estimate, statistic, p.value) %>%
  kable()
method estimate statistic p.value
Pearson’s product-moment correlation -0.0466462 -4.483405 7.4e-06

t

ggplot(t.scores.full, aes(x = Lg10WF, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

abs-t

ggplot(t.scores.full, aes(x = Lg10WF, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Predicting t with sentiment

Now let’s see if we can predict the t-score based on characteristics of the cue, controlling for frequency. Three sources of sentiment norms: (1) Warriner et al. (2013; N = 13,915), (2) ANEW (N = 1,030), and (3) Dodds et al. (2011; N = 10,192).

Sentiment from Warriner et al. (2013)

Valence, arousal and dominance norms for 13,915 lemmas total (Warriner et al., 2013). This is the largest set of emotion norms we have. Scales: unhappy/calm to happy/excited.

SUMMARY: For t, there is more divergence for words that are high arousal (no effect of valence). For abs t, there’s more digerence for words that have low valence (marginal).

cues.ts =  t.scores.full %>%
  select(cue, t, t_abs, Lg10WF,V.Mean.Sum,A.Mean.Sum, D.Mean.Sum) %>%
  distinct() %>%
  filter(!is.na(Lg10WF) & !is.na(V.Mean.Sum))
  • N = 10,049 cues
  • N = 352 cues that we have warriner senitments for, but not freq.
  • N = 1840 cues that we have frequency for but not warriner sentiments.
  • N = 477 cues that we have neither sentiments frequency, nor frequency

Leaving us with 7380 cues with both frequency and sentiment norms.

t

Look at correlation between norms

correlate(cues.ts %>%  select(-1,-3,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()
rowname t Lg10WF V.Mean.Sum A.Mean.Sum D.Mean.Sum
t
Lg10WF -.00
V.Mean.Sum .01 .15
A.Mean.Sum -.03 .04 -.17
D.Mean.Sum .01 .13 .72 -.17
Valence
kable(tidy(lm(t~ V.Mean.Sum  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) -0.0154603 0.0023015 -6.717520 0.0000000
V.Mean.Sum 0.0004325 0.0003472 1.245518 0.2129808
Lg10WF -0.0001368 0.0006678 -0.204854 0.8376918
ggplot(cues.ts, aes(x = V.Mean.Sum, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Arousal
kable(tidy(lm(t ~ A.Mean.Sum+  Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) -0.0088229 0.0026086 -3.3822997 0.0007226
A.Mean.Sum -0.0011466 0.0004888 -2.3459372 0.0190056
Lg10WF 0.0000506 0.0006606 0.0765332 0.9389970
ggplot(cues.ts, aes(x = A.Mean.Sum, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

absolute-t

Look at correlation between norms

correlate(cues.ts %>%  select(-1:-2,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()
rowname t_abs Lg10WF V.Mean.Sum A.Mean.Sum D.Mean.Sum
t_abs
Lg10WF -.05
V.Mean.Sum -.03 .15
A.Mean.Sum -.00 .04 -.17
D.Mean.Sum -.01 .13 .72 -.17
Valence
kable(tidy(lm(t_abs~ V.Mean.Sum  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0389883 0.0014832 26.286673 0.0000000
V.Mean.Sum -0.0004252 0.0002238 -1.900076 0.0574622
Lg10WF -0.0015951 0.0004304 -3.706353 0.0002118
ggplot(cues.ts, aes(x = V.Mean.Sum, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Arousal
kable(tidy(lm(t_abs~ A.Mean.Sum  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0369054 0.0016820 21.9420223 0.0000000
A.Mean.Sum 0.0000388 0.0003152 0.1231376 0.9020015
Lg10WF -0.0017195 0.0004260 -4.0367243 0.0000548
ggplot(cues.ts, aes(x = A.Mean.Sum, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Sentiment from ANEW (1993)

Valence, arousal and dominance norms for 1000 lemmas total.

SUMMARY: No effects across valence, arousal, and dominance.

anew = read.csv("data/anew.csv") %>%
  select(Description, Valence.Mean, Arousal.Mean, Dominance.Mean)

cues.ts =  t.scores.full %>%
  left_join(anew, by=c("cue"="Description")) %>%
  select(cue, t, t_abs, Lg10WF,Valence.Mean,Arousal.Mean, Dominance.Mean) %>%
  distinct() %>%
  filter(!is.na(Lg10WF) & !is.na(Arousal.Mean))

There are 897 cues with both frequency and sentiment norms.

t

Look at correlation between norms

correlate(cues.ts %>%  select(-1,-3,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()
rowname t Lg10WF Valence.Mean Arousal.Mean Dominance.Mean
t
Lg10WF .08
Valence.Mean .03 .20
Arousal.Mean -.04 .10 -.05
Dominance.Mean .02 .15 .84 .07
Valence
kable(tidy(lm(t~ Valence.Mean  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) -0.0279249 0.0055489 -5.0325246 0.0000006
Valence.Mean 0.0003473 0.0006191 0.5610354 0.5749140
Lg10WF 0.0037419 0.0017382 2.1527998 0.0316014
ggplot(cues.ts, aes(x = Valence.Mean, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Arousal
kable(tidy(lm(t ~ Arousal.Mean+  Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) -0.0187853 0.0073935 -2.540792 0.0112282
Arousal.Mean -0.0016813 0.0011457 -1.467482 0.1425968
Lg10WF 0.0041850 0.0017096 2.447906 0.0145602
ggplot(cues.ts, aes(x = Arousal.Mean, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Dominance
kable(tidy(lm(t ~ Dominance.Mean+  Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) -0.0281081 0.0073504 -3.8240358 0.0001404
Dominance.Mean 0.0003243 0.0012009 0.2700417 0.7871905
Lg10WF 0.0038670 0.0017229 2.2445057 0.0250438
ggplot(cues.ts, aes(x = Dominance.Mean, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

absolute-t

Look at correlation between norms

correlate(cues.ts %>%  select(-1:-2,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()
rowname t_abs Lg10WF Valence.Mean Arousal.Mean Dominance.Mean
t_abs
Lg10WF -.08
Valence.Mean -.03 .20
Arousal.Mean .01 .10 -.05
Dominance.Mean .00 .15 .84 .07
Valence
kable(tidy(lm(t_abs~ Valence.Mean  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0398593 0.0036508 10.9179275 0.0000000
Valence.Mean -0.0001413 0.0004073 -0.3468943 0.7287524
Lg10WF -0.0027116 0.0011436 -2.3710972 0.0179465
ggplot(cues.ts, aes(x = Valence.Mean, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Arousal
kable(tidy(lm(t_abs ~ Arousal.Mean+  Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0374192 0.0048690 7.6852599 0.0000000
Arousal.Mean 0.0004116 0.0007545 0.5455187 0.5855329
Lg10WF -0.0028517 0.0011259 -2.5328693 0.0114833
ggplot(cues.ts, aes(x = Arousal.Mean, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Dominance
kable(tidy(lm(t_abs ~ Dominance.Mean+  Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0374979 0.0048350 7.7555465 0.0000000
Dominance.Mean 0.0004189 0.0007900 0.5303054 0.5960319
Lg10WF -0.0028815 0.0011333 -2.5425823 0.0111713
ggplot(cues.ts, aes(x = Dominance.Mean, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Sentiment from Dodds et al. (2011)

SUMMARY: For t, there’s a marginal effect of happiness_average and happines rank: more divergence for less happy words. For abs-t, there’s a significant effect of happiness_rank: more divergence for less happy words.

dodds = read.table("data/dodds_happiness.txt", sep = "\t", header = T, na.strings = "--") %>%
  select(word, happiness_rank, happiness_average, happiness_standard_deviation)

cues.ts =  t.scores.full %>%
  left_join(dodds, by=c("cue"="word")) %>%
  select(cue, t, t_abs, Lg10WF,happiness_rank, happiness_average, 
         happiness_standard_deviation) %>%
  distinct() %>%
  filter(!is.na(Lg10WF) & !is.na(happiness_average))
# NOTE: figure out where the missing ones are going?

There are 4769 cues with both frequency and sentiment norms.

t

Look at correlation between norms

correlate(cues.ts %>%  select(-1,-3,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()
rowname t Lg10WF happiness_rank happiness_average happiness_standard_deviation
t
Lg10WF .08
happiness_rank -.03 -.02
happiness_average .03 .03 -.95
happiness_standard_deviation -.02 -.17 .12 -.12
happiness_rank
kable(tidy(lm(t ~ happiness_rank+  Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) -0.0266206 0.0024783 -10.741443 0.0000000
happiness_rank -0.0000003 0.0000002 -1.715569 0.0863060
Lg10WF 0.0041240 0.0007649 5.391494 0.0000001
ggplot(cues.ts, aes(x = happiness_rank, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

happiness_average
kable(tidy(lm(t~ happiness_average  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) -0.0323810 0.0032681 -9.908226 0.0000000
happiness_average 0.0008116 0.0004340 1.870292 0.0615045
Lg10WF 0.0041097 0.0007650 5.371850 0.0000001
ggplot(cues.ts, aes(x = happiness_average, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

happiness_standard_deviation
kable(tidy(lm(t ~ happiness_standard_deviation+  Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) -0.0262903 0.0038760 -6.7827808 0.0000000
happiness_standard_deviation -0.0011065 0.0019087 -0.5797419 0.5621161
Lg10WF 0.0040774 0.0007765 5.2512299 0.0000002
ggplot(cues.ts, aes(x = happiness_standard_deviation, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

abs t

Look at correlation between norms

correlate(cues.ts %>%  select(-1,-2,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()
rowname t_abs Lg10WF happiness_rank happiness_average happiness_standard_deviation
t_abs
Lg10WF -.03
happiness_rank .03 -.02
happiness_average -.02 .03 -.95
happiness_standard_deviation -.02 -.17 .12 -.12
happiness_rank
kable(tidy(lm(t_abs~ happiness_rank  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0345614 0.0016216 21.312744 0.0000000
happiness_rank 0.0000002 0.0000001 1.997434 0.0458348
Lg10WF -0.0009465 0.0005005 -1.891080 0.0586743
ggplot(cues.ts, aes(x = happiness_rank, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

happiness_average
kable(tidy(lm(t_abs~ happiness_average  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0381736 0.0021388 17.847989 0.0000000
happiness_average -0.0004716 0.0002840 -1.660424 0.0968950
Lg10WF -0.0009437 0.0005007 -1.884861 0.0595094
ggplot(cues.ts, aes(x = happiness_average, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

happiness_standard_deviation
kable(tidy(lm(t_abs ~ happiness_standard_deviation+  Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0391986 0.0025358 15.458222 0.0000000
happiness_standard_deviation -0.0021671 0.0012487 -1.735465 0.0827232
Lg10WF -0.0011214 0.0005080 -2.207619 0.0273183
ggplot(cues.ts, aes(x = happiness_standard_deviation, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Predicting t with concreteness

SUMMARY: For both t and abs-t, there’s the predicted effect on concreteness: Words that are less concrete have more divergence.

cues.ts =  t.scores.full %>%
  select(cue, t, t_abs, Lg10WF, Conc.M) %>%
  distinct() %>%
  filter(!is.na(Lg10WF) & !is.na(Conc.M))

There are 8526 cues with both frequency and concretness words.

t

kable(tidy(lm(t~Conc.M  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) -0.0170579 0.0019681 -8.6673617 0.0000000
Conc.M 0.0011523 0.0003977 2.8974898 0.0037712
Lg10WF -0.0002412 0.0005261 -0.4583734 0.6466959
ggplot(cues.ts, aes(x = Conc.M, y = t)) +
  #geom_point() +
  geom_smooth(method  = "lm") +
  theme_bw()

absolute-t

kable(tidy(lm(t_abs~ Conc.M  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0416727 0.0012614 33.036485 0.00e+00
Conc.M -0.0013854 0.0002549 -5.435274 1.00e-07
Lg10WF -0.0014881 0.0003372 -4.413100 1.03e-05
ggplot(cues.ts, aes(x = Conc.M, y = t_abs)) +
  #geom_point() +
  geom_smooth(method  = "lm") +
  theme_bw()