Here are all the words we normed projected onto 2-dimensional semantic space. The semantic space comes from a word2vec model trained on English Wikipedia. The word color indicates the quartile of girl-relatedness (larger quartiles = more associated with girls).
coordinates <- read_csv("data/tsne_book_words.csv") %>%
mutate(gender_tile = as.factor(gender_tile))
coordinates %>%
ggplot(aes(x = tsne_X, y = tsne_Y, color = gender_tile)) +
geom_text(aes(label = word), size = 2) +
theme_void() +
guides(color=guide_legend(title = "Quartile of \ngirl relatedness"))
The above representation is difficult to interpret though. So I clustered all the words into 100 clusters using k-means and labeled each cluster. Then, for each cluster, I took the centroid (X-Y mean) of all the words in that cluster. Below are the clusters mappings.
coordinates %>%
select(1,3,2,4) %>%
DT::datatable()
This plot shows mean gender rating of the words in each of the clusters (red = more associated with girls). The size of the circle corresponds to the number of words in that cluster.
centroid_df <- coordinates %>%
group_by(cluster_id, cluster_label) %>%
summarize(tsne_X = mean(tsne_X),
tsne_Y = mean(tsne_Y),
n = n(),
mean_gender = mean(mean_gender_rating))
centroid_df %>%
ggplot() +
ggforce::geom_circle(aes(x0 = tsne_X, y0 = tsne_Y, r = n/10, fill = mean_gender)) +
geom_text(aes(x = tsne_X, y = tsne_Y, label = cluster_label), size = 2.7) +
theme_void() +
scale_fill_gradient2("mean female\ngender rating",
midpoint = median(centroid_df$mean_gender),
low = "blue", high = "red") +
theme(legend.position="bottom")
This plot shows the relative frequency of words in that cluster in girl vs. boy books (based on median split). Red indicates that a word occurs more often in girl books. The size of the cluster corresponds to the number of words in that cluster. Note that this analysis is somewhat circular since the book gender ratings were based on the word ratings.
BOOK_GENDER <- "../11_by_book_analyses/data/embedding_gender_by_book_token.csv"
book_gender <- read_csv(BOOK_GENDER)
book_gender_tidy <- book_gender %>%
mutate(book_tile = ntile(token_mean, 2)) %>%
select(doc_id, book_tile, token_mean)
BOOK_TOKENS <- "../11_by_book_analyses/data/tidy_lcnl_kidbook_corpus.csv"
book_tokens <- read_csv(BOOK_TOKENS)
book_tokens_tidy <- book_tokens %>%
left_join(book_gender_tidy) %>%
right_join(coordinates) %>%
filter(!is.na(book_tile))
girl_props <- book_tokens_tidy %>%
group_by(cluster_id, book_tile) %>%
summarize(n_words = n()) %>%
group_by(cluster_id) %>%
mutate(prop = n_words/sum(n_words))
girl_props_tidy <- girl_props %>%
filter(book_tile == 2) %>%
left_join(centroid_df)
girl_props_tidy %>%
ggplot() +
ggforce::geom_circle(aes(x0 = tsne_X, y0 = tsne_Y, r = n/10, fill = prop)) +
geom_text(aes(x = tsne_X, y = tsne_Y, label = cluster_label), size = 2.7) +
theme_void() +
scale_fill_gradient2("Relative frequency of \nwords in girl books",
midpoint = median(girl_props_tidy$prop),
low = "blue", high = "red") +
theme(legend.position="bottom")