Visualizing gender assocations in semantic space

All words

Here are all the words we normed projected onto 2-dimensional semantic space. The semantic space comes from a word2vec model trained on English Wikipedia. The word color indicates the quartile of girl-relatedness (larger quartiles = more associated with girls).

coordinates <- read_csv("data/tsne_book_words.csv") %>%
  mutate(gender_tile = as.factor(gender_tile))

coordinates %>%
ggplot(aes(x = tsne_X, y = tsne_Y, color = gender_tile)) +
  geom_text(aes(label = word), size = 2) +
  theme_void() +
  guides(color=guide_legend(title = "Quartile of \ngirl relatedness"))

Clustered words

The above representation is difficult to interpret though. So I clustered all the words into 100 clusters using k-means and labeled each cluster. Then, for each cluster, I took the centroid (X-Y mean) of all the words in that cluster. Below are the clusters mappings.

coordinates %>%
  select(1,3,2,4) %>%
  DT::datatable()

Mean gender of word

This plot shows mean gender rating of the words in each of the clusters (red = more associated with girls). The size of the circle corresponds to the number of words in that cluster.

centroid_df <- coordinates %>%
  group_by(cluster_id, cluster_label) %>%
  summarize(tsne_X = mean(tsne_X),
            tsne_Y = mean(tsne_Y),
            n = n(),
            mean_gender = mean(mean_gender_rating))

centroid_df %>%
  ggplot() +
  ggforce::geom_circle(aes(x0 = tsne_X, y0 = tsne_Y, r = n/10, fill = mean_gender)) +
  geom_text(aes(x = tsne_X, y = tsne_Y, label = cluster_label), size = 2.7) +
  theme_void() + 
  scale_fill_gradient2("mean female\ngender rating", 
                       midpoint = median(centroid_df$mean_gender), 
                       low = "blue", high = "red")  +
  theme(legend.position="bottom")

Relative frequency by book type

This plot shows the relative frequency of words in that cluster in girl vs. boy books (based on median split). Red indicates that a word occurs more often in girl books. The size of the cluster corresponds to the number of words in that cluster. Note that this analysis is somewhat circular since the book gender ratings were based on the word ratings.

BOOK_GENDER <- "../11_by_book_analyses/data/embedding_gender_by_book_token.csv"
book_gender <- read_csv(BOOK_GENDER)

book_gender_tidy <- book_gender %>%
  mutate(book_tile = ntile(token_mean, 2)) %>%
  select(doc_id, book_tile, token_mean)

BOOK_TOKENS <- "../11_by_book_analyses/data/tidy_lcnl_kidbook_corpus.csv"
book_tokens <- read_csv(BOOK_TOKENS)

book_tokens_tidy <- book_tokens %>%
  left_join(book_gender_tidy) %>%
  right_join(coordinates) %>%
  filter(!is.na(book_tile)) 


girl_props <- book_tokens_tidy %>%
  group_by(cluster_id, book_tile) %>%
  summarize(n_words = n()) %>%
  group_by(cluster_id) %>%
  mutate(prop = n_words/sum(n_words)) 

girl_props_tidy <- girl_props %>%
  filter(book_tile == 2) %>%
  left_join(centroid_df)

girl_props_tidy %>%
  ggplot() +
  ggforce::geom_circle(aes(x0 = tsne_X, y0 = tsne_Y, r = n/10, fill = prop)) +
  geom_text(aes(x = tsne_X, y = tsne_Y, label = cluster_label), size = 2.7) +
  theme_void() + 
  scale_fill_gradient2("Relative frequency of \nwords in girl books", 
                       midpoint = median(girl_props_tidy$prop), 
                       low = "blue", high = "red")  +
  theme(legend.position="bottom")

Visualizing gender assocations in semantic space

Molly Lewis

2018-11-25

All words

Clustered words

Mean gender of word

Relative frequency by book type