Question: Do second language learners have less hierarchical (i.e. “smoother”) spaces, relative to native speakers?

Data: Small World of Words dataset. In the task, participants are given a cue, and are asked to generate 3 associates. Each participant completes 15-19 trials.

Method: Construct an unweighted network of cue-association pairs (i.e. edge between cue and associate if cue produced associate) for each language group (native and non-native English speakers). For each network, measure modularity of the network using Newman (2006) algorithm. Predict that non-native speakers will have less modular spaces than native speakers, despite having smaller vocabularies.

d.raw = read.csv("../data/associations_ppdetails_en_05_01_2015.csv")

lang.codes = read.csv("../data/language_codes.csv") %>%
  select(ISO639.2BCode, LanguageName)

d.long = d.raw %>%
  gather("association", "word", 7:9) %>%
  mutate(word = gsub("\\bx\\b", "NA", word)) %>% # remove missing words
  spread("association", "word") %>%
  rename(a1 = asso1Clean,
         a2 = asso2Clean,
         a3 = asso3Clean) 

#### d.clean Notes ####
# LANGUAGE CODES: I am excluding language codes codes that I either don't understand (e.g. eng) or are missing.
# I am asuming that all upper case codes are country codes indicating where English is spoken, but this should be verified. 
# CUES: I am excluding all cues that are two words (e.g. "head & shoulders") or are only one letter ("b")

d.clean = d.long %>%
  left_join(lang.codes, by = c("nativeLanguage" = "ISO639.2BCode")) %>%
  filter(nativeLanguage != "eng" & nativeLanguage != "" & nativeLanguage != "99" &
          nativeLanguage != "fla" & nativeLanguage != "can"  & nativeLanguage != "nan"  & nativeLanguage != "pun" & nativeLanguage != "nl") %>%
  #filter(nchar(as.character(cue)) > 1) %>%
 # filter(sapply(gregexpr("[[:alpha:]]+", cue), function(x) sum(x > 0)) == 1) %>%
  mutate(LanguageName = ifelse(grepl("^[[:upper:]]+$", nativeLanguage), "English", as.character(LanguageName)),
         LanguageName = as.factor(LanguageName),
         country = ifelse(grepl("^[[:upper:]]+$", nativeLanguage), nativeLanguage, NA),
         country = as.factor(country),
         native.lang = ifelse(LanguageName == "English", "english", "other"),
         native.lang = as.factor(native.lang)) %>%
  select(-nativeLanguage)

Constants

# when sampling participants, how many samples to average over?
NSAMP <- 10

Demographics

demo.summary = d.clean %>%
  group_by(userID) %>%
  slice(1)

We have 71781 participants total. Based on “nativeLanguage” variable, we code each participant as native speaker of English, and non-native otherwise. We have 62412 native English speakers and 9369 non-native English speakers.

Non-native speakers are more highly educated and younger than native speakers.

demo.summary %>%
  group_by(LanguageName, native.lang) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  arrange(-n) %>%
  slice(1:25) %>%
  ggplot(aes(x = reorder(LanguageName,-n), y = n, fill = native.lang)) +
    geom_bar(stat = "identity") +
    xlab("language") +
    theme_bw(base_size = 15) +
    theme(legend.position="none",
    axis.text.x = element_text(angle = 90, hjust = 1))