Processing math: 100%
  • Read in data
  • Participant counts
  • Get t-score
  • Predicting divergence with sentiment [t]
    • Frequency
    • Sentiment from tidytext package
    • Sentiment from Warriner et al. (2013)
  • Predicting divergence with sentiment [abs t]
    • Frequency
    • Sentiment from Warriner et al. (2013)
  • Predicting divergence with concreteness [t]
  • Predicting divergence with concreteness [t_abs]


Do the statistics of cue-associate bigrams differ across native and non-native speakers?

Read in data

Munge associates

d.raw = read.csv("../../data/associations_ppdetails_en_05_01_2015.csv") %>%
  rename(raw_lang_id = nativeLanguage)

d.long = d.raw %>%
  gather("association", "word", 7:9) %>%
  mutate(word = gsub("\\bx\\b", "NA", word)) %>% # remove missing words
  spread("association", "word") %>%
  rename(a1 = asso1Clean,
         a2 = asso2Clean,
         a3 = asso3Clean) %>%
  mutate(a1 = trimws(tolower(a1)),
         a2 = trimws(tolower(a2)),
         a3 = trimws(tolower(a3))) %>%
  gather("associate.type", "associate", 7:9) %>% ## collapse across associates
  filter(cue != "NA" & associate != "NA") %>%
  mutate(bigram = paste(cue, associate))

Infer native language based on ISO codes. Note that raw_lang_id is either language ISO or country ISO. It is a country ISO if the language is English, otherwise language ISO. Some raw_lang_ids do not match either. Remove these.

lang.codes = read.csv("../../data/language_codes.csv", fileEncoding="latin1") %>%
  select(LanguageName, ISO639.2BCode)%>%
  rename(langName_ISO = LanguageName)

country.codes = read.csv("../../data/ISO_3_countrycodes.csv") %>%
  select(name, ISO3166.1.Alpha.3) %>%
  rename(countryName_ISO = name)

d.clean = d.long %>%
  filter(raw_lang_id != "") %>%
  left_join(lang.codes, by = c("raw_lang_id" = "ISO639.2BCode")) %>%
  left_join(country.codes, by = c("raw_lang_id" = "ISO3166.1.Alpha.3")) %>%
  filter(!is.na(langName_ISO) | !is.na(countryName_ISO)) %>%
  mutate(native.lang.name = ifelse(!is.na(langName_ISO), as.character(langName_ISO), "English"),
         native.lang = ifelse(native.lang.name == "English", "L1", "L2"),
         english.country = ifelse(!is.na(countryName_ISO), as.character(countryName_ISO), NA),
         langName_ISO = ifelse((native.lang == "English"), "eng", raw_lang_id))%>%
  select(langName_ISO, native.lang, native.lang.name, english.country, userID, age, gender, 
         education, cue, associate.type, associate, bigram) %>%
  mutate(native.lang.name = as.factor(native.lang.name),
         native.lang = as.factor(native.lang),
         english.country = as.factor(english.country),
         langName_ISO = as.factor(langName_ISO))

Participant counts

d.clean %>%
    group_by(userID, native.lang) %>%
    slice(1) %>%
    group_by(native.lang) %>%
    summarise(n = n()) %>%
    arrange(-n) %>%
    kable()
native.lang n
L1 62196
L2 9369

Get t-score

From Manning and Schuetze (1999; pg. 166): Discover co-occurences that best distinguish between words within the same corpus. This is slightly different than our goal, because here we want to compare across corpora, and thus our N’s are different. Here, I modify their formula to account for this (analagous to Welch’s two-sample t-test).

Manning and Schuetze: t=C(v1w)NC(v2w)NC(v1w)+C(v2w)N2

Welch’s version: t=C(v1w)n1C(v2w)n2C(v1w)n1+C(v2w)n2

So suppose we want to figure out how divergent responses are for the cue “zucchini” between native and non-native speakers. For each associate that participants produced for this cue, we’d compare the relative frequency between native and non-native. For example, let’s look at the frequency of producing “vegetable” as an associate for zucchini. Native speakers produced “vegetable” in response to zucchini 37 times and non-native 5. We also need the base rates of producing vegetable as an associate. Native speakers produced vegetable as an associate 161 times and non-native speakers produced vegetable 51 times. So, we have:

C(zv)native=37C(zv)nonnative=5C(v)native=161C(v)nonnative=51

t=3716155137161+551=.23

We then do this for every associate that was produced for zuchinni, and take the mean t-score across associates. This is our estimage of how “divergent” responses are between native and non-native speakers for a given cue.

Calculate t-scores

# get C(vw) in each language (bigram counts)
bigram.counts = d.clean %>%
  count(native.lang, cue, associate, bigram) 

# get C(v) in each langauge (associate counts)
total.bigrams.native.lang = bigram.counts %>%
  group_by(native.lang, associate) %>%
  summarize(n_associate = n())

# get C(vw)/C(v) in each langauge
bigram.counts.rf = bigram.counts %>%
  rename(n_bigram = n) %>%
  left_join(total.bigrams.native.lang) %>%
  mutate(rf = n_bigram/n_associate)
  
t.scores <- bigram.counts.rf %>%
  ungroup() %>%
  as_tibble() %>%
  select(native.lang, rf, cue, associate) %>%
  spread(native.lang, rf) %>%
  filter(!is.na(L1), !is.na(L2)) %>%
  mutate(t = (L1 - L2)/sqrt(L1 + L2))

Predicting divergence with sentiment [t]

Now let’s see if we can predict the t-score based on characteristics of the cue.

t-scores

ggplot(t.scores, aes(x = t)) +
  geom_histogram() +
  theme_bw() +
  ggtitle("distribution of t-scores")

Join t.scores and cues characteristics

cues.chars = read.csv("../second_orderQs/data/cues_chars.csv") %>%
  select(-1)

t.scores.full = t.scores %>%
  group_by(cue) %>%
  summarize(t = mean(t),
            t_abs = mean(abs(t))) %>%
  left_join(cues.chars) 

Frequency

Does frequency of cue predict mean t for a cue?

freq.ts = t.scores.full %>%
  select(cue, t, Lg10WF) %>%
  distinct()

ggplot(freq.ts, aes(x = Lg10WF, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Less divergence for high frequency items. But clearly not the whole effect.

Sentiment from tidytext package

tidytext::get_sentiments("afinn") (Finn Arup Nielse emotion ratings)

Does sentiment of cue predict mean t for a cue? Note that there’s only 1063 words here.

quantsent.ts =  t.scores.full %>%
      select(quant.sent, t, Lg10WF) %>%
  distinct()

ggplot(quantsent.ts, aes(x = quant.sent, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

quantsent.ts %>%
  ungroup() %>%
  do(tidy(cor.test(.$t, .$quant.sent))) %>%
  select(estimate, statistic, p.value) %>%
  kable()
estimate statistic p.value
0.0264075 0.8649214 0.3872754

No effect here.

Sentiment from Warriner et al. (2013)

Valence, arousal and dominance norms for 13,915 lemmas total (https://link.springer.com/article/10.3758%2Fs13428-012-0314-x). This provides a much larger sample to look at the relationship between valence and diveregence. Scales: unhappy/calm to happy/excited.

warriner.sent = read.csv("data/BRM-emot-submit.csv") %>%
  select(Word, V.Mean.Sum, A.Mean.Sum, D.Mean.Sum)

t.scores.full = left_join(t.scores.full, warriner.sent, by=c("cue"= "Word"))

cues.ts =  t.scores.full %>%
  select(cue, t, t_abs, Lg10WF,V.Mean.Sum,A.Mean.Sum, D.Mean.Sum, Conc.M) %>%
  distinct() %>%
  filter(!is.na(Lg10WF) & !is.na(V.Mean.Sum))

There are 7380 cues with both frequency and emotion words.

Look at correlation between norms

correlate(cues.ts %>%  select(-1,-3,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()
rowname t Lg10WF V.Mean.Sum A.Mean.Sum D.Mean.Sum
t
Lg10WF -.00
V.Mean.Sum .04 .15
A.Mean.Sum -.06 .04 -.17
D.Mean.Sum .01 .13 .72 -.17

Valence

There’s an effect of valence: Low valence items have more divergence.

kable(tidy(lm(t~ V.Mean.Sum  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) -0.0071329 0.0037326 -1.910975 0.0560465
V.Mean.Sum 0.0020584 0.0005632 3.655105 0.0002589
Lg10WF -0.0006985 0.0010830 -0.644980 0.5189602
ggplot(cues.ts, aes(x = V.Mean.Sum, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Arousal

There’s also an effect of arousal: High arousal words diverge more, and this doesn’t interact with frequency.

kable(tidy(lm(t ~ A.Mean.Sum+  Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0191154 0.0042277 4.5214319 0.0000062
A.Mean.Sum -0.0041485 0.0007922 -5.2369500 0.0000002
Lg10WF 0.0001214 0.0010707 0.1133471 0.9097584
ggplot(cues.ts, aes(x = A.Mean.Sum, y = t)) +
  geom_smooth(method  = "lm") +
  theme_bw()

In sum, for t, there is more diverence for words that are low valence and high arousal. This holds controling for frequency.

Predicting divergence with sentiment [abs t]

Now let’s look at absolute t value.

t-scores

ggplot(t.scores.full, aes(x = t_abs)) +
  geom_histogram() +
  theme_bw() +
  ggtitle("distribution of t-scores")

Frequency

Does frequency of cue predict mean t for a cue?

freq.ts =  t.scores.full %>%
  select(cue, t_abs, Lg10WF) %>%
  distinct()

ggplot(freq.ts, aes(x = Lg10WF, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

Less divergence for high frequency items. But clearly not the whole effect.

Sentiment from Warriner et al. (2013)

Look at correlation between norms

correlate(cues.ts %>%  select(-1:-2,-8)) %>%
  shave() %>%
  fashion() %>%
  kable()
rowname t_abs Lg10WF V.Mean.Sum A.Mean.Sum D.Mean.Sum
t_abs
Lg10WF -.07
V.Mean.Sum -.05 .15
A.Mean.Sum -.01 .04 -.17
D.Mean.Sum -.03 .13 .72 -.17

Valence

kable(tidy(lm(t_abs~ V.Mean.Sum  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0642224 0.0025277 25.407167 0.0000000
V.Mean.Sum -0.0013382 0.0003814 -3.508879 0.0004527
Lg10WF -0.0042801 0.0007334 -5.835574 0.0000000
ggplot(cues.ts, aes(x = V.Mean.Sum, y = t_abs)) +
  geom_smooth(method  = "lm") +
  theme_bw()

In sum, for abs t, there only an effect of valence (there’s no effect of arousal or dominance.. Need to think about what the directional effect means for arousal).

Predicting divergence with concreteness [t]

kable(tidy(lm(t~ Conc.M  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) -0.0299568 0.0035328 -8.4797021 0.0000000
Conc.M 0.0094234 0.0006794 13.8698371 0.0000000
Lg10WF -0.0002598 0.0010595 -0.2452373 0.8062795
ggplot(cues.ts, aes(x = Conc.M, y = t)) +
  #geom_point() +
  geom_smooth(method  = "lm") +
  theme_bw()

There’s the predicted effect on concreteness: Words that are more concretee have less divergence.

Predicting divergence with concreteness [t_abs]

kable(tidy(lm(t_abs~ Conc.M  + Lg10WF, data = cues.ts)))
term estimate std.error statistic p.value
(Intercept) 0.0575290 0.0024230 23.7429134 0.0000000
Conc.M 0.0000689 0.0004660 0.1479639 0.8823753
Lg10WF -0.0045254 0.0007267 -6.2276389 0.0000000
ggplot(cues.ts, aes(x = Conc.M, y = t_abs)) +
  #geom_point() +
  geom_smooth(method  = "lm") +
  theme_bw()