Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language:
LDA extracted 20 different topics:
library(tidyverse)
library(tidytext)
lda_tweets <- read_rds(here::here("data", "lda_tweets.rds"))
topic_tweets <- tidy(lda_tweets)
tw_top_terms <- topic_tweets %>%
group_by(topic) %>%
top_n(20, beta) %>%
ungroup() %>%
arrange(topic, -beta)
tw_top_terms %>%
mutate(
topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic),
) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = F) +
facet_wrap(~topic, scales = "free", ncol = 3) +
coord_flip() +
scale_x_reordered()
Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities called gamma. The following chart shows the occurrence of each topic by gender. That is, topic 4 is present as the main option in 171 users classified as “f” by gender extractor.
topic_tweets <- tidy(lda_tweets, matrix = "gamma")
gender_output <- read_tsv("/Volumes/TOSHIBA EXT/neda_twenty/gender_extractor/gender_output.txt",
col_names = c("id", "name", "processed_name", "gender")
) %>%
select(name, gender)
topic_tweets %>%
left_join(gender_output, by = c("document" = "name")) %>%
# filter(document == "(Name)") %>%
# arrange(desc(gamma))
# count(topic, gender, sort = T) %>%
group_by(document) %>%
top_n(1, wt = gamma) %>%
ungroup() %>%
count(gender, topic, sort = T) -> f
f %>%
mutate(topic = fct_reorder(as_factor(topic), n)) %>%
arrange(desc(n)) %>%
ggplot(aes(x = topic, y = n, fill = gender)) +
geom_col(show.legend = T, position = "dodge2") +
scale_x_reordered() +
coord_flip()
The following table shows the same information as the previous chart:
tw_top_terms %>%
group_by(topic) %>%
summarise(text = paste(term, collapse = " ")) %>%
rename("Most common words in topics" = text) -> tw_top_terms_b
f %>%
pivot_wider(id_cols = topic, names_from = gender, values_from = n) %>%
left_join(tw_top_terms_b) %>%
select(topic, `Most common words in topics`, u, f, m) %>%
mutate(topic = paste0("Topic ", topic)) %>%
group_by(topic) %>%
gt::gt()
| Most common words in topics | u | f | m |
|---|---|---|---|
| Topic 4 | |||
| health people mental eating day week time fala coutinho learn support care awareness daily por healthy momento disorders national body | 178 | 171 | 54 |
| Topic 6 | |||
| people love time day yall fuck life shit fucking feel tweet ur stop post girl literally gonna friends women girls | 78 | 174 | 28 |
| Topic 7 | |||
| day #win week join march april win time students hey spring free learn check pinned people #carista @youtube @yahoonews love | 92 | 36 | 22 |
| Topic 17 | |||
| people love day time daily news life happy week health students book feel hope mental school support google follow amazing | 53 | 84 | 25 |
| Topic 11 | |||
| people love mental god time day life feel health illness happy @mentalhealthmil person week stop book @thesarahfader world jesus talk | 49 | 70 | 19 |
| Topic 3 | |||
| day time love people happy week life women game world story team live support @jamesmaslow birthday amazing hope @cnn night | 32 | 52 | 24 |
| Topic 19 | |||
| people trump white women time black day love president disabled @aoc #teamrpstrength life stop woman person word house disability children | 28 | 44 | 19 |
| Topic 12 | |||
| love people day time life happy feel morning fuck shit lol fucking beautiful hope yall @davelackie stop night god @thespybrief | 20 | 42 | 15 |
| Topic 5 | |||
| @financialbuzz news de cse breaking watch buzz otcqb free featuring otc time @smoclerk1 check street video love day @youtube la | 33 | 15 | 38 |
| Topic 16 | |||
| love @davepperlmutter @gwenstefani day time @blakeshelton happy people night birthday amazing life tonight gwen hope video week @officialwith1d2 song blake | 28 | 34 | 17 |
| Topic 20 | |||
| digital detox day love time #icymi week people read puyo post @floss84 feel review life happy @gdcribbs check hope book | 20 | 26 | 4 |
| Topic 8 | |||
| love people day time god life @gospelflava world happy @dyonnelewis women video @youtube live music @zacharylevi @xomeganashley follow week watch | 26 | 25 | 12 |
| Topic 13 | |||
| #training #fitness day time people love life happy @mets game morning @snytv week health women follow season @siacademy night world | 15 | 24 | 14 |
| Topic 10 | |||
| people @tonyakay love @gpbgeorge day time die ich @tonyakayfan10 @brookelewisla @princessfarhana happy und life week world god das @spann watch | 21 | 14 | 8 |
| Topic 9 | |||
| de la el en los es se por para del con las lo una da está te anos como al | 14 | 20 | 9 |
| Topic 14 | |||
| di il la nie na che @fitetv się jak live le ma jest @youtube mi że watch con ale daily | 19 | 17 | 8 |
| Topic 2 | |||
| @swampmusicinfo music swamp love players @laurarjacobs scared #soundcloud laura jacobs top @crushwb #saludtues #music #retro #americana picks ft song natural | 15 | 13 | 7 |
| Topic 18 | |||
| @tazbat99 di news @alfamart @financialbuzz time facebook yang connect dan breaking sahabat love cse otc beli people watch top @wangza | 14 | 7 | 14 |
| Topic 15 | |||
| de la le pour les en des je une du pas cest sur ce qui vous est dans il au | 13 | 9 | 9 |
| Topic 1 | |||
| de la #siguemeytesigo en el online #siguemeytesigoalinstante watch https://t.co/g5c08ompz1 viewyng check day film video los stream con es news @ciff | 10 | 8 | 8 |