Topic modeling

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language:

Word-topic probabilities

LDA extracted 20 different topics:

library(tidyverse)
library(tidytext)

lda_tweets <- read_rds(here::here("data", "lda_tweets.rds"))

topic_tweets <- tidy(lda_tweets)

tw_top_terms <- topic_tweets %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

tw_top_terms %>%
  mutate(
    topic = paste0("Topic ", topic),
    term = reorder_within(term, beta, topic),
  ) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = F) +
  facet_wrap(~topic, scales = "free", ncol = 3) +
  coord_flip() +
  scale_x_reordered()

Document-topic probabilities

Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities called gamma. The following chart shows the occurrence of each topic by gender. That is, topic 4 is present as the main option in 171 users classified as “f” by gender extractor.

topic_tweets <- tidy(lda_tweets, matrix = "gamma")

gender_output <- read_tsv("/Volumes/TOSHIBA EXT/neda_twenty/gender_extractor/gender_output.txt",
  col_names = c("id", "name", "processed_name", "gender")
) %>%
  select(name, gender)

topic_tweets %>%
  left_join(gender_output, by = c("document" = "name")) %>%
  # filter(document == "(Name)") %>%
  # arrange(desc(gamma))
  # count(topic, gender, sort = T) %>%
  group_by(document) %>%
  top_n(1, wt = gamma) %>%
  ungroup() %>%
  count(gender, topic, sort = T) -> f

f %>%
  mutate(topic = fct_reorder(as_factor(topic), n)) %>%
  arrange(desc(n)) %>%
  ggplot(aes(x = topic, y = n, fill = gender)) +
  geom_col(show.legend = T, position = "dodge2") +
  scale_x_reordered() +
  coord_flip()

The following table shows the same information as the previous chart:

tw_top_terms %>%
  group_by(topic) %>%
  summarise(text = paste(term, collapse = " ")) %>%
  rename("Most common words in topics" = text) -> tw_top_terms_b

f %>%
  pivot_wider(id_cols = topic, names_from = gender, values_from = n) %>%
  left_join(tw_top_terms_b) %>%
  select(topic, `Most common words in topics`, u, f, m) %>%
  mutate(topic = paste0("Topic ", topic)) %>%
  group_by(topic) %>%
  gt::gt()
Most common words in topics u f m
Topic 4
health people mental eating day week time fala coutinho learn support care awareness daily por healthy momento disorders national body 178 171 54
Topic 6
people love time day yall fuck life shit fucking feel tweet ur stop post girl literally gonna friends women girls 78 174 28
Topic 7
day #win week join march april win time students hey spring free learn check pinned people #carista @youtube @yahoonews love 92 36 22
Topic 17
people love day time daily news life happy week health students book feel hope mental school support google follow amazing 53 84 25
Topic 11
people love mental god time day life feel health illness happy @mentalhealthmil person week stop book @thesarahfader world jesus talk 49 70 19
Topic 3
day time love people happy week life women game world story team live support @jamesmaslow birthday amazing hope @cnn night 32 52 24
Topic 19
people trump white women time black day love president disabled @aoc #teamrpstrength life stop woman person word house disability children 28 44 19
Topic 12
love people day time life happy feel morning fuck shit lol fucking beautiful hope yall @davelackie stop night god @thespybrief 20 42 15
Topic 5
@financialbuzz news de cse breaking watch buzz otcqb free featuring otc time @smoclerk1 check street video love day @youtube la 33 15 38
Topic 16
love @davepperlmutter @gwenstefani day time @blakeshelton happy people night birthday amazing life tonight gwen hope video week @officialwith1d2 song blake 28 34 17
Topic 20
digital detox day love time #icymi week people read puyo post @floss84 feel review life happy @gdcribbs check hope book 20 26 4
Topic 8
love people day time god life @gospelflava world happy @dyonnelewis women video @youtube live music @zacharylevi @xomeganashley follow week watch 26 25 12
Topic 13
#training #fitness day time people love life happy @mets game morning @snytv week health women follow season @siacademy night world 15 24 14
Topic 10
people @tonyakay love @gpbgeorge day time die ich @tonyakayfan10 @brookelewisla @princessfarhana happy und life week world god das @spann watch 21 14 8
Topic 9
de la el en los es se por para del con las lo una da está te anos como al 14 20 9
Topic 14
di il la nie na che @fitetv się jak live le ma jest @youtube mi że watch con ale daily 19 17 8
Topic 2
@swampmusicinfo music swamp love players @laurarjacobs scared #soundcloud laura jacobs top @crushwb #saludtues #music #retro #americana picks ft song natural 15 13 7
Topic 18
@tazbat99 di news @alfamart @financialbuzz time facebook yang connect dan breaking sahabat love cse otc beli people watch top @wangza 14 7 14
Topic 15
de la le pour les en des je une du pas cest sur ce qui vous est dans il au 13 9 9
Topic 1
de la #siguemeytesigo en el online #siguemeytesigoalinstante watch https://t.co/g5c08ompz1 viewyng check day film video los stream con es news @ciff 10 8 8
LS0tCnRpdGxlOiAiVG9waWMgTW9kZWxpbmciIApjbGVhbjogdHJ1ZQpvdXRwdXQ6CiAgYm9va2Rvd246Omh0bWxfZG9jdW1lbnQyOgogICAgbnVtYmVyX3NlY3Rpb25zOiBmYWxzZQogICAgY29kZV9kb3dubG9hZDogdHJ1ZQogICAgY29kZV9mb2xkaW5nOiBoaWRlCiAgICBzZWxmX2NvbnRhaW5lZDogdHJ1ZQogICAgdG9jOiB0cnVlCiAgICB0b2NfZmxvYXQ6IGZhbHNlCi0tLQoKYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9CmtuaXRyOjpvcHRzX2NodW5rJHNldCgKICBjYWNoZSA9IFRSVUUsCiAgY2FjaGUubGF6eSA9IEZBTFNFLAoJZWNobyA9IFRSVUUsCgltZXNzYWdlID0gRkFMU0UsCgl3YXJuaW5nID0gRkFMU0UsCgl0aWR5ID0gInN0eWxlciIKKQpnZ3Bsb3QyOjp0aGVtZV9zZXQoZ2dwbG90Mjo6dGhlbWVfbGluZWRyYXcoYmFzZV9zaXplID0gOCkpCmBgYAoKIyMgVG9waWMgbW9kZWxpbmcKCkxhdGVudCBEaXJpY2hsZXQgYWxsb2NhdGlvbiAoTERBKSBpcyBhIHBhcnRpY3VsYXJseSBwb3B1bGFyIG1ldGhvZCBmb3IgZml0dGluZyBhIHRvcGljIG1vZGVsLiBJdCB0cmVhdHMgZWFjaCBkb2N1bWVudCBhcyBhIG1peHR1cmUgb2YgdG9waWNzLCBhbmQgZWFjaCB0b3BpYyBhcyBhIG1peHR1cmUgb2Ygd29yZHMuIFRoaXMgYWxsb3dzIGRvY3VtZW50cyB0byDigJxvdmVybGFw4oCdIGVhY2ggb3RoZXIgaW4gdGVybXMgb2YgY29udGVudCwgcmF0aGVyIHRoYW4gYmVpbmcgc2VwYXJhdGVkIGludG8gZGlzY3JldGUgZ3JvdXBzLCBpbiBhIHdheSB0aGF0IG1pcnJvcnMgdHlwaWNhbCB1c2Ugb2YgbmF0dXJhbCBsYW5ndWFnZToKCiMjIyBXb3JkLXRvcGljIHByb2JhYmlsaXRpZXMKTERBIGV4dHJhY3RlZCAyMCBkaWZmZXJlbnQgdG9waWNzOiAKYGBge3Igb3V0LndpZHRoID0gIjEwMCUiLCBmaWcuaGVpZ2h0PSAxNX0KbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkodGlkeXRleHQpCgpsZGFfdHdlZXRzIDwtIHJlYWRfcmRzKGhlcmU6OmhlcmUoImRhdGEiLCAibGRhX3R3ZWV0cy5yZHMiKSkKCnRvcGljX3R3ZWV0cyA8LSB0aWR5KGxkYV90d2VldHMpCgp0d190b3BfdGVybXMgPC0gdG9waWNfdHdlZXRzICU+JSAKICBncm91cF9ieSh0b3BpYykgJT4lIAogIHRvcF9uKDIwLCBiZXRhKSAlPiUgCiAgdW5ncm91cCgpICU+JSAKICBhcnJhbmdlKHRvcGljLCAtYmV0YSkKCnR3X3RvcF90ZXJtcyAlPiUgCiAgbXV0YXRlKAogICAgdG9waWMgPSBwYXN0ZTAoIlRvcGljICIsIHRvcGljKSwKICAgIHRlcm0gPSByZW9yZGVyX3dpdGhpbih0ZXJtLCBiZXRhLCB0b3BpYyksCiAgICAgICAgICkgJT4lIAogIGdncGxvdChhZXModGVybSwgYmV0YSwgZmlsbCA9IGZhY3Rvcih0b3BpYykpKSArCiAgZ2VvbV9jb2woc2hvdy5sZWdlbmQgPSBGKSArCiAgZmFjZXRfd3JhcCh+IHRvcGljLCBzY2FsZXMgPSAiZnJlZSIsIG5jb2wgPSAzKSArCiAgY29vcmRfZmxpcCgpICsKICBzY2FsZV94X3Jlb3JkZXJlZCgpIAoKYGBgCgoKIyMjIERvY3VtZW50LXRvcGljIHByb2JhYmlsaXRpZXMKCkJlc2lkZXMgZXN0aW1hdGluZyBlYWNoIHRvcGljIGFzIGEgbWl4dHVyZSBvZiB3b3JkcywgTERBIGFsc28gbW9kZWxzIGVhY2ggZG9jdW1lbnQgYXMgYSBtaXh0dXJlIG9mIHRvcGljcy4gV2UgY2FuIGV4YW1pbmUgdGhlIHBlci1kb2N1bWVudC1wZXItdG9waWMgcHJvYmFiaWxpdGllcyBjYWxsZWQgZ2FtbWEuIFRoZSBmb2xsb3dpbmcgY2hhcnQgc2hvd3MgdGhlIG9jY3VycmVuY2Ugb2YgZWFjaCB0b3BpYyBieSBnZW5kZXIuIFRoYXQgaXMsIHRvcGljIDQgaXMgcHJlc2VudCBhcyB0aGUgbWFpbiBvcHRpb24gaW4gMTcxIHVzZXJzIGNsYXNzaWZpZWQgYXMgImYiIGJ5IGdlbmRlciBleHRyYWN0b3IuCgpgYGB7cn0KdG9waWNfdHdlZXRzIDwtIHRpZHkobGRhX3R3ZWV0cywgbWF0cml4ID0gImdhbW1hIikKCmdlbmRlcl9vdXRwdXQgPC0gcmVhZF90c3YoIi9Wb2x1bWVzL1RPU0hJQkEgRVhUL25lZGFfdHdlbnR5L2dlbmRlcl9leHRyYWN0b3IvZ2VuZGVyX291dHB1dC50eHQiLAogICAgICAgICAgICAgICAgICAgICAgICAgIGNvbF9uYW1lcyA9IGMoImlkIiwgIm5hbWUiLCAicHJvY2Vzc2VkX25hbWUiLCAiZ2VuZGVyIikpICU+JSAKICBzZWxlY3QobmFtZSwgZ2VuZGVyKQoKdG9waWNfdHdlZXRzICU+JQogIGxlZnRfam9pbihnZW5kZXJfb3V0cHV0LCBieSA9IGMoImRvY3VtZW50IiA9ICJuYW1lIikpICU+JSAKICAjIGZpbHRlcihkb2N1bWVudCA9PSAiKE5hbWUpIikgJT4lIAogICMgYXJyYW5nZShkZXNjKGdhbW1hKSkKICAjIGNvdW50KHRvcGljLCBnZW5kZXIsIHNvcnQgPSBUKSAlPiUgCiAgZ3JvdXBfYnkoZG9jdW1lbnQpICU+JSAKICB0b3BfbigxLCB3dCA9IGdhbW1hKSAlPiUgCiAgdW5ncm91cCgpICU+JSAKICBjb3VudChnZW5kZXIsIHRvcGljLCBzb3J0ID0gVCkgLT4gZiAKCmYgJT4lIAogIG11dGF0ZSh0b3BpYyA9IGZjdF9yZW9yZGVyKGFzX2ZhY3Rvcih0b3BpYyksIG4pCiAgICAgICAgICkgJT4lIAogIGFycmFuZ2UoZGVzYyhuKSkgJT4lIAogIGdncGxvdChhZXMoeCA9IHRvcGljLCB5ID0gbiwgZmlsbCA9IGdlbmRlcikpICsKICBnZW9tX2NvbChzaG93LmxlZ2VuZCA9IFQsIHBvc2l0aW9uID0gImRvZGdlMiIpICsKICBzY2FsZV94X3Jlb3JkZXJlZCgpICsKICBjb29yZF9mbGlwKCkgCmBgYAoKVGhlIGZvbGxvd2luZyB0YWJsZSBzaG93cyB0aGUgc2FtZSBpbmZvcm1hdGlvbiBhcyB0aGUgcHJldmlvdXMgY2hhcnQ6CgpgYGB7cn0KCnR3X3RvcF90ZXJtcyAlPiUgCiAgZ3JvdXBfYnkodG9waWMpICU+JQogIHN1bW1hcmlzZSh0ZXh0PXBhc3RlKHRlcm0sY29sbGFwc2U9JyAnKSkgJT4lIAogIHJlbmFtZSgiTW9zdCBjb21tb24gd29yZHMgaW4gdG9waWNzIiA9IHRleHQpIC0+IHR3X3RvcF90ZXJtc19iCiAgCmYgJT4lIAogIHBpdm90X3dpZGVyKGlkX2NvbHMgPSB0b3BpYywgbmFtZXNfZnJvbSA9IGdlbmRlciwgdmFsdWVzX2Zyb20gPSBuKSAlPiUgCiAgbGVmdF9qb2luKHR3X3RvcF90ZXJtc19iKSAlPiUgCiAgc2VsZWN0KHRvcGljLCBgTW9zdCBjb21tb24gd29yZHMgaW4gdG9waWNzYCwgdSwgZiwgbSkgJT4lIAogIG11dGF0ZSh0b3BpYyA9IHBhc3RlMCgiVG9waWMgIiwgdG9waWMpKSAlPiUgCiAgZ3JvdXBfYnkodG9waWMpICU+JSAKICBndDo6Z3QoKQogIApgYGAKCgo=