#overlap_coefficients_params2.csv
d <- read_csv("overlap_coefficients_output2.csv",
col_names = c("word_type", "ci_low", "ci_high",
"mean", "n_random", "n_closestwords",
"n_vectors", "n_threads", "n_iter",
"windowsize", "min_count"))
d_raw_es <- d %>%
filter(word_type != "random") %>%
left_join(d %>% filter(word_type == "random") %>%
rename(ci_low_random = ci_low,
ci_high_random = ci_high,
mean_random = mean) %>%
select(-word_type)) %>%
mutate(n = case_when(word_type == "antonym" ~ 47,
word_type == "gender_noun" ~ 6,
word_type == "gender_pronoun" ~ 5,
word_type == "gender_relational" ~ 9,
word_type == "synonym" ~ 78)) %>%
mutate(sd = ((ci_high-ci_low)/2) * sqrt(n)/1.96,
sd_random = ((ci_high_random-ci_low_random)/2) * sqrt(n_random)/1.96)
d_es <- d_raw_es %>%
rowwise() %>%
do(compute.es::mes(.$mean, .$mean_random,
.$sd, .$sd_random,
.$n, .$n_random,
verbose = F) %>% select(d, pval.d)) %>%
bind_cols(d_raw_es) %>%
mutate(sig = ifelse(pval.d < .05, "*",""))
Each point corresponds to an effect size from the model trained with certain parameters. I trained word2vec on the Montag corpus on words that appeared sufficiently frequently (here, n = 15), with a window of varying widths (5, 10, 15, 20 words). I then examined the k closest words to each of two target words (e.g. “he” and “she”). These k words can be thought of as the concepts most semantically associated with the target word. Next, I calculated a metric of how much these two clusters of words overlapped. This overlap coefficient was determined by taking the pairwise distances between all words within each cluster and across clusters. The coefficient is then given by the mean across distance divided by the average within distance. This value will be large is there is a lot of overlap, and small if there is minimal overlap (note that similarity is measured as cosine distance here which means larger values correspond to higher similarity).
I calculated this overlap coefficient for five types of word pairs: gender pronouns (e.g. “he” vs. “she”), gender nouns (“boy” vs. “girl”), and gender relation terms (e.g. “mom” vs. “dad”), antonyms and synonyms. These terms were an ad hoc list I developed (see below). The prediction is that synonym and antonym word pairs should have high overlap, whereas the gender words should have less.
Next, I compared the overlap coefficent for these word pairs to the overlap coefficient for two randomly-selected words by calculating an effect size between the two. This is what is plotted below along the y-axis. Positive values indicate overlap of the associates of word pairs, and negative values indicate distinct clustering of the associates of word pairs. The red dashed line indicates and effect size of zero, corresponding to the case where the amount of overlap is identical in the target word types compared to two randomly selected words. The x-axis corresponds the size of the window in the word2vec model. The facets correspond to k - the number of associated words examined.
d_es %>%
filter(min_count == 15) %>%
ggplot(aes(x = windowsize, y = d, group = word_type)) +
geom_line() +
geom_point(aes(shape = sig, color = word_type), size = 2) +
facet_grid(min_count ~ n_closestwords) +
geom_hline(aes(yintercept = 0), color = "red", linetype = 2) +
theme_classic()
read_csv("gender_words.csv") %>%
DT::datatable()
read_csv("synonyms_antonyms.csv") %>%
select(1:3) %>%
DT::datatable()
# The w2v models seem sensitive to parameters. Here are the findings with lots of different parameters.
Take aways:
* number of iterations maters a little bit - do at least 20 [60]
* vector size doesn't matter - do 100 [100]
* mincount matters a little bit - maybe do like 15 [5, 10, 16]
* window size matters - >4 [2,4,8,10,12]
* number of threads matters - 8 is totally different than 4