Associate differences for native vs. non-native speakers

Do L1 and L2 speakers differ in their semantic spaces for the same English words? Here, we address this question by asking whether properties of the cue predict differences in associates that speakers produce, comparing L1 and L2 speakers. In particular, we look at whether the sentiment and concreteness of the cue predicts divergence, controlling for the spoken frequency of the word.

To measure divergence, we look at the number of associates for each cue-associate pair in each language.

SUMMARY:

High level question: things depend on whether you use t or abs-t.
For t, more divergence for words that are high arousal, low happiness, low concreteness.
For abs-t, more divergence for high frequency, low valence (marginal), low happiness, low concreteness.

Read in data.

d.clean = read.csv("../data/dclean.csv")

Get t-score

This is a measure of divergence for a cue between L1 and L2.

From Manning and Schuetze (1999; pg. 166): Discover co-occurences that best distinguish between words within the same corpus. This is slightly different than our goal, because here we want to compare across corpora, and thus our N’s are different. Here, I modify their formula to account for this (analagous to Welch’s two-sample t-test).

Manning and Schuetze: \[ t = \frac{\frac{C(v_1w)}{N}-\frac{C(v_2w)}{N}} {\sqrt{\frac{C(v_1w) + C(v_2w)}{N^2}}} \]

Welch’s version: \[ t = \frac{\frac{C(v_1w)}{n_1}-\frac{C(v_2w)}{n_2}} {\sqrt{\frac{C(v_1w)}{n_1} + \frac{C(v_2w)}{n_2}}} \]

So suppose we want to figure out how divergent responses are for the cue “zucchini” between native and non-native speakers. For each associate that participants produced for this cue, we’d compare the relative frequency between native and non-native. For example, let’s look at the frequency of producing “vegetable” as an associate for zucchini. Native speakers produced “vegetable” in response to zucchini 37 times and non-native 5. We also need the base rates of producing vegetable as an associate. Native speakers produced vegetable as an associate 161 times and non-native speakers produced vegetable 51 times. So, we have:

Cue = “zucchini” (z)

Associate = “vegetable” (v)

\[ C(zv)_{native}=37\\ C(zv)_{non-native}=5\\ C(v)_{native}= 161\\ C(v)_{non-native} = 51 \]

\[ t = \frac{\frac{37}{161}-\frac{5}{51}} {\sqrt{\frac{37}{161} + \frac{5}{51}}} = .23 \]

We then do this for every associate that was produced for zuchinni, and take the mean t-score across associates. This is our estimate of how “divergent” responses are between native and non-native speakers for a given cue. Note that POSITIVE values indicate that associates were produced relatively more by native speakers on average, and NEGATIVE values indiate that associates were were produced relatively more on average by non-native speakers.

Calculate t-scores

# get dataset with shared bigrams only (i think this is necessary)
native.bigrams = unique(filter(d.clean,native.lang == "L1"))
non.native.bigrams = unique(filter(d.clean,native.lang == "L2"))
shared.bigrams = intersect(native.bigrams$bigram, 
                           non.native.bigrams$bigram)

d.common = d.clean %>%
  ungroup() %>%
  filter(bigram %in% shared.bigrams)

# get C(vw) in each language (bigram counts)
bigram.counts = d.common %>%
  group_by(native.lang, bigram, associate, cue) %>%
  summarize(n_bigram = n())  

# get C(v) in each language (associate counts)
associate.counts = d.common %>%
  group_by(native.lang, associate) %>%
  summarize(n_associate = n())

# get C(vw)/C(v) in each langauge
bigram.counts.rf = bigram.counts %>%
  left_join(associate.counts, by = c("native.lang", "associate")) %>%
  mutate(rf = n_bigram/n_associate)
  
t.scores <- bigram.counts.rf %>%
  ungroup() %>%
  as_tibble() %>%
  select(native.lang, rf, cue, associate) %>%
  spread(native.lang, rf) %>%
  filter(!is.na(L1), !is.na(L2)) %>%
  mutate(t = (L1 - L2)/sqrt(L1 + L2))

t-scores

ggplot(t.scores, aes(x = t)) +
  geom_histogram() +
  theme_bw() +
  ggtitle("distribution of t-scores")

absolute t-scores

ggplot(t.scores, aes(x = abs(t))) +
  geom_histogram() +
  theme_bw() +
  ggtitle("distribution of t-scores")

More divergence for high frequency items. But clearly not the whole effect.

Predicting t with frequency

As a control (from http://subtlexus.lexique.org/). - note we’re losing some of our sentiment norms below because we don’t have freqeuency for all the items - let’s think about a better source here.

SUMMARY: Less divergence for high frequency items (there’s only an effect for abs_t). But clearly not the whole effect.

Join t.scores and cues characteristics

cues.chars = d.clean %>%
  group_by(cue) %>%
  slice(1) %>%
  select(Lg10WF, quant.sent, Conc.M, V.Mean.Sum, A.Mean.Sum, D.Mean.Sum)

t.scores.full = t.scores %>%
  group_by(cue) %>%
  summarize(t = mean(t),
            t_abs = mean(abs(t))) %>%
  left_join(cues.chars)

correlate(t.scores.full %>%  select(2:4)) %>%
  shave() %>%
  fashion() %>%
  kable()

rowname	t	t_abs
t
t_abs	-.23
Lg10WF	-.02	-.05

cor.test(t.scores.full$t, t.scores.full$Lg10WF) %>% 
  tidy() %>%
  select(method, estimate, statistic, p.value) %>%
  kable()

method	estimate	statistic	p.value
Pearson’s product-moment correlation	-0.0164701	-1.581519	0.1137938

cor.test(t.scores.full$t_abs, t.scores.full$Lg10WF) %>% 
  tidy() %>%
  select(method, estimate, statistic, p.value) %>%
  kable()

method	estimate	statistic	p.value
Pearson’s product-moment correlation	-0.0466462	-4.483405	7.4e-06

t

ggplot(t.scores.full, aes(x = Lg10WF, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

abs-t

ggplot(t.scores.full, aes(x = Lg10WF, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Predicting t with sentiment

Now let’s see if we can predict the t-score based on characteristics of the cue, controlling for frequency. Three sources of sentiment norms: (1) Warriner et al. (2013; N = 13,915), (2) ANEW (N = 1,030), and (3) Dodds et al. (2011; N = 10,192).

Sentiment from Warriner et al. (2013)

Valence, arousal and dominance norms for 13,915 lemmas total (Warriner et al., 2013). This is the largest set of emotion norms we have. Scales: unhappy/calm to happy/excited.

SUMMARY: For t, there is more divergence for words that are high arousal (no effect of valence). For abs t, there’s more digerence for words that have low valence (marginal).

cues.ts =  t.scores.full %>%
  select(cue, t, t_abs, Lg10WF,V.Mean.Sum,A.Mean.Sum, D.Mean.Sum) %>%
  distinct() %>%
  filter(!is.na(Lg10WF) & !is.na(V.Mean.Sum))

N = 10,049 cues
N = 352 cues that we have warriner senitments for, but not freq.
N = 1840 cues that we have frequency for but not warriner sentiments.
N = 477 cues that we have neither sentiments frequency, nor frequency

Leaving us with 7380 cues with both frequency and sentiment norms.

t

Look at correlation between norms

correlate(cues.ts %>%  select(-1,-3,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()

rowname	t	Lg10WF	V.Mean.Sum	A.Mean.Sum
t
Lg10WF	-.00
V.Mean.Sum	.01	.15
A.Mean.Sum	-.03	.04	-.17
D.Mean.Sum	.01	.13	.72	-.17

Valence

kable(tidy(lm(t~ V.Mean.Sum  + Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0154603	0.0023015	-6.717520	0.0000000
V.Mean.Sum	0.0004325	0.0003472	1.245518	0.2129808
Lg10WF	-0.0001368	0.0006678	-0.204854	0.8376918

ggplot(cues.ts, aes(x = V.Mean.Sum, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Arousal

kable(tidy(lm(t ~ A.Mean.Sum+  Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0088229	0.0026086	-3.3822997	0.0007226
A.Mean.Sum	-0.0011466	0.0004888	-2.3459372	0.0190056
Lg10WF	0.0000506	0.0006606	0.0765332	0.9389970

ggplot(cues.ts, aes(x = A.Mean.Sum, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

absolute-t

Look at correlation between norms

correlate(cues.ts %>%  select(-1:-2,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()

rowname	t_abs	Lg10WF	V.Mean.Sum	A.Mean.Sum
t_abs
Lg10WF	-.05
V.Mean.Sum	-.03	.15
A.Mean.Sum	-.00	.04	-.17
D.Mean.Sum	-.01	.13	.72	-.17

Valence

kable(tidy(lm(t_abs~ V.Mean.Sum  + Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	0.0389883	0.0014832	26.286673	0.0000000
V.Mean.Sum	-0.0004252	0.0002238	-1.900076	0.0574622
Lg10WF	-0.0015951	0.0004304	-3.706353	0.0002118

ggplot(cues.ts, aes(x = V.Mean.Sum, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Arousal

kable(tidy(lm(t_abs~ A.Mean.Sum  + Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	0.0369054	0.0016820	21.9420223	0.0000000
A.Mean.Sum	0.0000388	0.0003152	0.1231376	0.9020015
Lg10WF	-0.0017195	0.0004260	-4.0367243	0.0000548

ggplot(cues.ts, aes(x = A.Mean.Sum, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Sentiment from ANEW (1993)

Valence, arousal and dominance norms for 1000 lemmas total.

SUMMARY: No effects across valence, arousal, and dominance.

anew = read.csv("data/anew.csv") %>%
  select(Description, Valence.Mean, Arousal.Mean, Dominance.Mean)

cues.ts =  t.scores.full %>%
  left_join(anew, by=c("cue"="Description")) %>%
  select(cue, t, t_abs, Lg10WF,Valence.Mean,Arousal.Mean, Dominance.Mean) %>%
  distinct() %>%
  filter(!is.na(Lg10WF) & !is.na(Arousal.Mean))

There are 897 cues with both frequency and sentiment norms.

t

Look at correlation between norms

correlate(cues.ts %>%  select(-1,-3,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()

rowname	t	Lg10WF	Valence.Mean	Arousal.Mean
t
Lg10WF	.08
Valence.Mean	.03	.20
Arousal.Mean	-.04	.10	-.05
Dominance.Mean	.02	.15	.84	.07

Valence

kable(tidy(lm(t~ Valence.Mean  + Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0279249	0.0055489	-5.0325246	0.0000006
Valence.Mean	0.0003473	0.0006191	0.5610354	0.5749140
Lg10WF	0.0037419	0.0017382	2.1527998	0.0316014

ggplot(cues.ts, aes(x = Valence.Mean, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Arousal

kable(tidy(lm(t ~ Arousal.Mean+  Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0187853	0.0073935	-2.540792	0.0112282
Arousal.Mean	-0.0016813	0.0011457	-1.467482	0.1425968
Lg10WF	0.0041850	0.0017096	2.447906	0.0145602

ggplot(cues.ts, aes(x = Arousal.Mean, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Dominance

kable(tidy(lm(t ~ Dominance.Mean+  Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0281081	0.0073504	-3.8240358	0.0001404
Dominance.Mean	0.0003243	0.0012009	0.2700417	0.7871905
Lg10WF	0.0038670	0.0017229	2.2445057	0.0250438

ggplot(cues.ts, aes(x = Dominance.Mean, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

absolute-t

Look at correlation between norms

correlate(cues.ts %>%  select(-1:-2,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()

rowname	t_abs	Lg10WF	Valence.Mean	Arousal.Mean
t_abs
Lg10WF	-.08
Valence.Mean	-.03	.20
Arousal.Mean	.01	.10	-.05
Dominance.Mean	.00	.15	.84	.07

Valence

kable(tidy(lm(t_abs~ Valence.Mean  + Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	0.0398593	0.0036508	10.9179275	0.0000000
Valence.Mean	-0.0001413	0.0004073	-0.3468943	0.7287524
Lg10WF	-0.0027116	0.0011436	-2.3710972	0.0179465

ggplot(cues.ts, aes(x = Valence.Mean, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Arousal

kable(tidy(lm(t_abs ~ Arousal.Mean+  Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	0.0374192	0.0048690	7.6852599	0.0000000
Arousal.Mean	0.0004116	0.0007545	0.5455187	0.5855329
Lg10WF	-0.0028517	0.0011259	-2.5328693	0.0114833

ggplot(cues.ts, aes(x = Arousal.Mean, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Dominance

kable(tidy(lm(t_abs ~ Dominance.Mean+  Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	0.0374979	0.0048350	7.7555465	0.0000000
Dominance.Mean	0.0004189	0.0007900	0.5303054	0.5960319
Lg10WF	-0.0028815	0.0011333	-2.5425823	0.0111713

ggplot(cues.ts, aes(x = Dominance.Mean, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Sentiment from Dodds et al. (2011)

SUMMARY: For t, there’s a marginal effect of happiness_average and happines rank: more divergence for less happy words. For abs-t, there’s a significant effect of happiness_rank: more divergence for less happy words.

dodds = read.table("data/dodds_happiness.txt", sep = "\t", header = T, na.strings = "--") %>%
  select(word, happiness_rank, happiness_average, happiness_standard_deviation)

cues.ts =  t.scores.full %>%
  left_join(dodds, by=c("cue"="word")) %>%
  select(cue, t, t_abs, Lg10WF,happiness_rank, happiness_average, 
         happiness_standard_deviation) %>%
  distinct() %>%
  filter(!is.na(Lg10WF) & !is.na(happiness_average))
# NOTE: figure out where the missing ones are going?

There are 4769 cues with both frequency and sentiment norms.

t

Look at correlation between norms

correlate(cues.ts %>%  select(-1,-3,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()

rowname	t	Lg10WF	happiness_rank	happiness_average
t
Lg10WF	.08
happiness_rank	-.03	-.02
happiness_average	.03	.03	-.95
happiness_standard_deviation	-.02	-.17	.12	-.12

happiness_rank

kable(tidy(lm(t ~ happiness_rank+  Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0266206	0.0024783	-10.741443	0.0000000
happiness_rank	-0.0000003	0.0000002	-1.715569	0.0863060
Lg10WF	0.0041240	0.0007649	5.391494	0.0000001

ggplot(cues.ts, aes(x = happiness_rank, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

happiness_average

kable(tidy(lm(t~ happiness_average  + Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0323810	0.0032681	-9.908226	0.0000000
happiness_average	0.0008116	0.0004340	1.870292	0.0615045
Lg10WF	0.0041097	0.0007650	5.371850	0.0000001

ggplot(cues.ts, aes(x = happiness_average, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

happiness_standard_deviation

kable(tidy(lm(t ~ happiness_standard_deviation+  Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0262903	0.0038760	-6.7827808	0.0000000
happiness_standard_deviation	-0.0011065	0.0019087	-0.5797419	0.5621161
Lg10WF	0.0040774	0.0007765	5.2512299	0.0000002

ggplot(cues.ts, aes(x = happiness_standard_deviation, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

abs t

Look at correlation between norms

correlate(cues.ts %>%  select(-1,-2,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()

rowname	t_abs	Lg10WF	happiness_rank	happiness_average
t_abs
Lg10WF	-.03
happiness_rank	.03	-.02
happiness_average	-.02	.03	-.95
happiness_standard_deviation	-.02	-.17	.12	-.12

happiness_rank

kable(tidy(lm(t_abs~ happiness_rank  + Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	0.0345614	0.0016216	21.312744	0.0000000
happiness_rank	0.0000002	0.0000001	1.997434	0.0458348
Lg10WF	-0.0009465	0.0005005	-1.891080	0.0586743

ggplot(cues.ts, aes(x = happiness_rank, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

happiness_average

kable(tidy(lm(t_abs~ happiness_average  + Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	0.0381736	0.0021388	17.847989	0.0000000
happiness_average	-0.0004716	0.0002840	-1.660424	0.0968950
Lg10WF	-0.0009437	0.0005007	-1.884861	0.0595094

ggplot(cues.ts, aes(x = happiness_average, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

happiness_standard_deviation

kable(tidy(lm(t_abs ~ happiness_standard_deviation+  Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	0.0391986	0.0025358	15.458222	0.0000000
happiness_standard_deviation	-0.0021671	0.0012487	-1.735465	0.0827232
Lg10WF	-0.0011214	0.0005080	-2.207619	0.0273183

ggplot(cues.ts, aes(x = happiness_standard_deviation, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Predicting t with concreteness

SUMMARY: For both t and abs-t, there’s the predicted effect on concreteness: Words that are less concrete have more divergence.

cues.ts =  t.scores.full %>%
  select(cue, t, t_abs, Lg10WF, Conc.M) %>%
  distinct() %>%
  filter(!is.na(Lg10WF) & !is.na(Conc.M))

There are 8526 cues with both frequency and concretness words.

t

kable(tidy(lm(t~Conc.M  + Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0170579	0.0019681	-8.6673617	0.0000000
Conc.M	0.0011523	0.0003977	2.8974898	0.0037712
Lg10WF	-0.0002412	0.0005261	-0.4583734	0.6466959

ggplot(cues.ts, aes(x = Conc.M, y = t)) +
  #geom_point() +
  geom_smooth(method  = "lm") +
  theme_bw()

absolute-t

kable(tidy(lm(t_abs~ Conc.M  + Lg10WF, data = cues.ts)))

term	estimate	std.error	statistic	p.value
(Intercept)	0.0416727	0.0012614	33.036485	0.00e+00
Conc.M	-0.0013854	0.0002549	-5.435274	1.00e-07
Lg10WF	-0.0014881	0.0003372	-4.413100	1.03e-05

ggplot(cues.ts, aes(x = Conc.M, y = t_abs)) +
  #geom_point() +
  geom_smooth(method  = "lm") +
  theme_bw()

Associate differences for native vs. non-native speakers

t-statistic

Molly Lewis

2017-04-17

Read in data.

Get t-score

t-scores

absolute t-scores

Predicting t with frequency

t

abs-t

Predicting t with sentiment

Sentiment from Warriner et al. (2013)

t

Valence

Arousal

absolute-t

Valence

Arousal

Sentiment from ANEW (1993)

t

Valence

Arousal

Dominance

absolute-t

Valence

Arousal

Dominance

Sentiment from Dodds et al. (2011)

t

happiness_rank

happiness_average

happiness_standard_deviation

abs t

happiness_rank

happiness_average

happiness_standard_deviation

Predicting t with concreteness

t

absolute-t