Topic modeling

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language:

Word-topic probabilities

LDA extracted 20 different topics:

library(tidyverse)
library(tidytext)

lda_tweets <- read_rds(here::here("data", "lda_tweets.rds"))

topic_tweets <- tidy(lda_tweets)

tw_top_terms <- topic_tweets %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

tw_top_terms %>%
  mutate(
    topic = paste0("Topic ", topic),
    term = reorder_within(term, beta, topic),
  ) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = F) +
  facet_wrap(~topic, scales = "free", ncol = 3) +
  coord_flip() +
  scale_x_reordered()

Document-topic probabilities

Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities called gamma. The following chart shows the occurrence of each topic by gender. That is, topic 4 is present as the main option in 171 users classified as “f” by gender extractor.

topic_tweets <- tidy(lda_tweets, matrix = "gamma")

gender_output <- read_tsv(here::here(
  "data",
  "gender_extractor",
  "neda_liwc_gender_output.tsv"
),
col_names = c("id", "name", "processed_name", "gender")
) %>%
  select(name, gender)

topic_tweets %>%
  left_join(gender_output, by = c("document" = "name")) %>%
  # filter(document == "(Name)") %>%
  # arrange(desc(gamma))
  # count(topic, gender, sort = T) %>%
  group_by(document) %>%
  top_n(1, wt = gamma) %>%
  ungroup() %>%
  count(gender, topic, sort = T) -> f

f %>%
  mutate(topic = fct_reorder(as_factor(topic), n)) %>%
  arrange(desc(n)) %>%
  ggplot(aes(x = topic, y = n, fill = gender)) +
  geom_col(show.legend = T, position = "dodge2") +
  scale_x_reordered() +
  coord_flip()

The following table shows the same information as the previous chart:

tw_top_terms %>%
  group_by(topic) %>%
  summarise(text = paste(term, collapse = " ")) %>%
  rename("Most common words in topics" = text) -> tw_top_terms_b

f %>%
  pivot_wider(id_cols = topic, names_from = gender, values_from = n) %>%
  left_join(tw_top_terms_b) %>%
  select(topic, `Most common words in topics`, u, f, m) %>%
  mutate(topic = paste0("Topic ", topic)) %>%
  group_by(topic) %>%
  gt::gt()
Most common words in topics u f m
Topic 10
people love day time life fucking yall fuck feel shit stop women girl ur literally friends happy gonna person lol 91 193 37
Topic 6
health mental people eating day week time support love disorders #mentalhealth care learn women awareness join disorder check body march 185 136 43
Topic 1
love day time people happy feel life week book amazing fuck words read @thesarahfader word shit twitter march omg gonna 31 65 13
Topic 19
puyo day post time love life people game speaker bluetooth black happy photography video @emtfr portable week free @nhlflyers follow 48 60 25
Topic 5
school people students love daily day time @tazbat99 week support join youth happy kids share check night #saludtues life health 51 29 19
Topic 20
people day news disability disabled love google check time join women students online week happy @thespybrief daily support world march 45 37 16
Topic 9
@financialbuzz news watch de @alfamart cse breaking buy love free @youtube time day di @marcuslemonis @smoclerk1 check online otc reviews 31 15 37
Topic 2
love day people time god @gospelflava morning #icymi @morematters life happy @floss84 week feel #elivetweets hope due @spann night @mollysdailykiss 19 33 11
Topic 17
trump people president women white black time @realdonaldtrump day trumps @aoc house national children news woman emergency love life american 26 33 17
Topic 7
people love day time @samsungmobileus life women happy disabled feel world white black trans @specialolympipa stop week support lot hope 17 31 8
Topic 4
digital detox love people day time happy video @gpbgeorge women @tonyakay week #oscars black world life book birthday @tonyakayfan10 @youtube 15 28 8
Topic 8
@jamesmaslow love day time people follow happy @brimoniq story week ig louis life world wait night amazing @officialwith1d2 morning @teenvogue 26 27 10
Topic 14
love day time people live @gpbgeorge happy watch @fcbarcelona @fitetv @proudxtianmaga hope week world game birthday night life women tonight 9 25 17
Topic 3
fala coutinho momento por bastidores em foco #fitness #training da daily os diego tati martins dos famosos famosidade política schueng 23 9 11
Topic 11
god @davepperlmutter love day time life people happy @donnasiggers1 @braedenlemaster book jesus amen read week @bradwallactor beautiful world friends #wrongplacewrongtime 17 21 13
Topic 18
de la el en los es con por psychology se lo tweet las post para info click audience #retweet #fiverr 9 18 11
Topic 16
@swampmusicinfo music love swamp players @laurarjacobs scared #carista pinned customization #car @davelackie #soundcloud laura ur song jacobs bro da day 14 16 11
Topic 15
de #nyc #newyork @nycdailypics anos da day em eu pahealthdept não love se la há happy #ligadoamusica para top city 13 14 7
Topic 13
de la le di il pour les en je des che une sur pas vous ce cest du qui si 11 9 6
Topic 12
nie na się jak jest ale pst że @blakeshelton march #weather #lax february #alert issued nws tak ja #la ze 10 6 1

Top tweets

LS0tCnRpdGxlOiAiVG9waWMgTW9kZWxpbmciIApjbGVhbjogdHJ1ZQpvdXRwdXQ6CiAgYm9va2Rvd246Omh0bWxfZG9jdW1lbnQyOgogICAgbnVtYmVyX3NlY3Rpb25zOiBmYWxzZQogICAgY29kZV9kb3dubG9hZDogdHJ1ZQogICAgY29kZV9mb2xkaW5nOiBoaWRlCiAgICBzZWxmX2NvbnRhaW5lZDogdHJ1ZQogICAgdG9jOiB0cnVlCiAgICB0b2NfZmxvYXQ6IGZhbHNlCi0tLQoKYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9CmtuaXRyOjpvcHRzX2NodW5rJHNldCgKICBjYWNoZSA9IFRSVUUsCiAgY2FjaGUubGF6eSA9IEZBTFNFLAoJZWNobyA9IFRSVUUsCgltZXNzYWdlID0gRkFMU0UsCgl3YXJuaW5nID0gRkFMU0UsCgl0aWR5ID0gInN0eWxlciIKKQpnZ3Bsb3QyOjp0aGVtZV9zZXQoZ2dwbG90Mjo6dGhlbWVfbGluZWRyYXcoYmFzZV9zaXplID0gOCkpCmBgYAoKIyMgVG9waWMgbW9kZWxpbmcKCkxhdGVudCBEaXJpY2hsZXQgYWxsb2NhdGlvbiAoTERBKSBpcyBhIHBhcnRpY3VsYXJseSBwb3B1bGFyIG1ldGhvZCBmb3IgZml0dGluZyBhIHRvcGljIG1vZGVsLiBJdCB0cmVhdHMgZWFjaCBkb2N1bWVudCBhcyBhIG1peHR1cmUgb2YgdG9waWNzLCBhbmQgZWFjaCB0b3BpYyBhcyBhIG1peHR1cmUgb2Ygd29yZHMuIFRoaXMgYWxsb3dzIGRvY3VtZW50cyB0byDigJxvdmVybGFw4oCdIGVhY2ggb3RoZXIgaW4gdGVybXMgb2YgY29udGVudCwgcmF0aGVyIHRoYW4gYmVpbmcgc2VwYXJhdGVkIGludG8gZGlzY3JldGUgZ3JvdXBzLCBpbiBhIHdheSB0aGF0IG1pcnJvcnMgdHlwaWNhbCB1c2Ugb2YgbmF0dXJhbCBsYW5ndWFnZToKCiMjIyBXb3JkLXRvcGljIHByb2JhYmlsaXRpZXMKTERBIGV4dHJhY3RlZCAyMCBkaWZmZXJlbnQgdG9waWNzOiAKYGBge3Igb3V0LndpZHRoID0gIjEwMCUiLCBmaWcuaGVpZ2h0PSAxNX0KbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkodGlkeXRleHQpCgpsZGFfdHdlZXRzIDwtIHJlYWRfcmRzKGhlcmU6OmhlcmUoImRhdGEiLCAibGRhX3R3ZWV0cy5yZHMiKSkKCnRvcGljX3R3ZWV0cyA8LSB0aWR5KGxkYV90d2VldHMpCgp0d190b3BfdGVybXMgPC0gdG9waWNfdHdlZXRzICU+JSAKICBncm91cF9ieSh0b3BpYykgJT4lIAogIHRvcF9uKDIwLCBiZXRhKSAlPiUgCiAgdW5ncm91cCgpICU+JSAKICBhcnJhbmdlKHRvcGljLCAtYmV0YSkKCnR3X3RvcF90ZXJtcyAlPiUgCiAgbXV0YXRlKAogICAgdG9waWMgPSBwYXN0ZTAoIlRvcGljICIsIHRvcGljKSwKICAgIHRlcm0gPSByZW9yZGVyX3dpdGhpbih0ZXJtLCBiZXRhLCB0b3BpYyksCiAgICAgICAgICkgJT4lIAogIGdncGxvdChhZXModGVybSwgYmV0YSwgZmlsbCA9IGZhY3Rvcih0b3BpYykpKSArCiAgZ2VvbV9jb2woc2hvdy5sZWdlbmQgPSBGKSArCiAgZmFjZXRfd3JhcCh+IHRvcGljLCBzY2FsZXMgPSAiZnJlZSIsIG5jb2wgPSAzKSArCiAgY29vcmRfZmxpcCgpICsKICBzY2FsZV94X3Jlb3JkZXJlZCgpIAoKYGBgCgoKIyMjIERvY3VtZW50LXRvcGljIHByb2JhYmlsaXRpZXMKCkJlc2lkZXMgZXN0aW1hdGluZyBlYWNoIHRvcGljIGFzIGEgbWl4dHVyZSBvZiB3b3JkcywgTERBIGFsc28gbW9kZWxzIGVhY2ggZG9jdW1lbnQgYXMgYSBtaXh0dXJlIG9mIHRvcGljcy4gV2UgY2FuIGV4YW1pbmUgdGhlIHBlci1kb2N1bWVudC1wZXItdG9waWMgcHJvYmFiaWxpdGllcyBjYWxsZWQgZ2FtbWEuIFRoZSBmb2xsb3dpbmcgY2hhcnQgc2hvd3MgdGhlIG9jY3VycmVuY2Ugb2YgZWFjaCB0b3BpYyBieSBnZW5kZXIuIFRoYXQgaXMsIHRvcGljIDQgaXMgcHJlc2VudCBhcyB0aGUgbWFpbiBvcHRpb24gaW4gMTcxIHVzZXJzIGNsYXNzaWZpZWQgYXMgImYiIGJ5IGdlbmRlciBleHRyYWN0b3IuCgpgYGB7cn0KdG9waWNfdHdlZXRzIDwtIHRpZHkobGRhX3R3ZWV0cywgbWF0cml4ID0gImdhbW1hIikKCmdlbmRlcl9vdXRwdXQgPC0gcmVhZF90c3YoaGVyZTo6aGVyZSgiZGF0YSIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAiZ2VuZGVyX2V4dHJhY3RvciIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAibmVkYV9saXdjX2dlbmRlcl9vdXRwdXQudHN2IiksCiAgICAgICAgICAgICAgICAgICAgICAgICAgY29sX25hbWVzID0gYygiaWQiLCAibmFtZSIsICJwcm9jZXNzZWRfbmFtZSIsICJnZW5kZXIiKSkgJT4lIAogIHNlbGVjdChuYW1lLCBnZW5kZXIpCgp0b3BpY190d2VldHMgJT4lCiAgbGVmdF9qb2luKGdlbmRlcl9vdXRwdXQsIGJ5ID0gYygiZG9jdW1lbnQiID0gIm5hbWUiKSkgJT4lIAogICMgZmlsdGVyKGRvY3VtZW50ID09ICIoTmFtZSkiKSAlPiUgCiAgIyBhcnJhbmdlKGRlc2MoZ2FtbWEpKQogICMgY291bnQodG9waWMsIGdlbmRlciwgc29ydCA9IFQpICU+JSAKICBncm91cF9ieShkb2N1bWVudCkgJT4lIAogIHRvcF9uKDEsIHd0ID0gZ2FtbWEpICU+JSAKICB1bmdyb3VwKCkgJT4lIAogIGNvdW50KGdlbmRlciwgdG9waWMsIHNvcnQgPSBUKSAtPiBmIAoKZiAlPiUgCiAgbXV0YXRlKHRvcGljID0gZmN0X3Jlb3JkZXIoYXNfZmFjdG9yKHRvcGljKSwgbikKICAgICAgICAgKSAlPiUgCiAgYXJyYW5nZShkZXNjKG4pKSAlPiUgCiAgZ2dwbG90KGFlcyh4ID0gdG9waWMsIHkgPSBuLCBmaWxsID0gZ2VuZGVyKSkgKwogIGdlb21fY29sKHNob3cubGVnZW5kID0gVCwgcG9zaXRpb24gPSAiZG9kZ2UyIikgKwogIHNjYWxlX3hfcmVvcmRlcmVkKCkgKwogIGNvb3JkX2ZsaXAoKSAKYGBgCgpUaGUgZm9sbG93aW5nIHRhYmxlIHNob3dzIHRoZSBzYW1lIGluZm9ybWF0aW9uIGFzIHRoZSBwcmV2aW91cyBjaGFydDoKCmBgYHtyfQoKdHdfdG9wX3Rlcm1zICU+JSAKICBncm91cF9ieSh0b3BpYykgJT4lCiAgc3VtbWFyaXNlKHRleHQ9cGFzdGUodGVybSxjb2xsYXBzZT0nICcpKSAlPiUgCiAgcmVuYW1lKCJNb3N0IGNvbW1vbiB3b3JkcyBpbiB0b3BpY3MiID0gdGV4dCkgLT4gdHdfdG9wX3Rlcm1zX2IKICAKZiAlPiUgCiAgcGl2b3Rfd2lkZXIoaWRfY29scyA9IHRvcGljLCBuYW1lc19mcm9tID0gZ2VuZGVyLCB2YWx1ZXNfZnJvbSA9IG4pICU+JSAKICBsZWZ0X2pvaW4odHdfdG9wX3Rlcm1zX2IpICU+JSAKICBzZWxlY3QodG9waWMsIGBNb3N0IGNvbW1vbiB3b3JkcyBpbiB0b3BpY3NgLCB1LCBmLCBtKSAlPiUgCiAgbXV0YXRlKHRvcGljID0gcGFzdGUwKCJUb3BpYyAiLCB0b3BpYykpICU+JSAKICBncm91cF9ieSh0b3BpYykgJT4lIAogIGd0OjpndCgpCmBgYAoKIyMgVG9wIHR3ZWV0cwoKYGBge3J9CgpgYGAKCgo=