Our modeling goal is to “discover” topics in the lyrics of Spice Girls songs. Instead of a supervised or predictive model where our observations have labels, this is an unsupervised approach.

https://juliasilge.com/blog/spice-girls/

equivalent terms between bigfoot and spice girls data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidytext)

bigfoot_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-09-13/bigfoot.csv')
## Rows: 5021 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): observed, location_details, county, state, season, title, classif...
## dbl  (17): latitude, longitude, number, temperature_high, temperature_mid, t...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explore data

bigfoot_raw %>%
  count(classification)
## # A tibble: 3 × 2
##   classification     n
##   <chr>          <int>
## 1 Class A         2481
## 2 Class B         2510
## 3 Class C           30
bigfoot <-
    bigfoot_raw %>%
    filter(classification != "Class C", !is.na(observed)) %>%
    mutate(
        classification = case_when(
        classification == "Class A" ~ "sighting",
        classification == "Class B" ~ "possible"
      )
    ) %>%
    
    # Create a new variable, observation id, equivalent of song_name in the Spice Girls case
    mutate(obs_id = row_number()) %>%
    select(obs_id, observed, season, county, state)

bigfoot
## # A tibble: 4,953 × 5
##    obs_id observed                                           season county state
##     <int> <chr>                                              <chr>  <chr>  <chr>
##  1      1 "I was canoeing on the Sipsey river in Alabama. I… Summer Winst… Alab…
##  2      2 "Ed L. was salmon fishing with a companion in Pri… Fall   Valde… Alas…
##  3      3 "While attending U.R.I in the Fall of 1974,I woul… Fall   Washi… Rhod…
##  4      4 "Hello, My name is Doug and though I am very relu… Summer York … Penn…
##  5      5 "It was May 1984. Two friends and I were up in th… Spring Yamhi… Oreg…
##  6      6 "My two children and I were returning from Altus,… Fall   Washi… Okla…
##  7      7 "I was staying the night with a friends of mine. … Summer Washi… Ohio 
##  8      8 "Well last year I was night fishing 9/2010 at Ama… Fall   Westc… New …
##  9      9 "I grew up in Northwestern Nevada along the Calif… Fall   Washo… Neva…
## 10     10 "heh i kinda feel a little dumb that im reporting… Fall   Warre… New …
## # … with 4,943 more rows
tidy_bigfoot <-
  bigfoot %>%
  unnest_tokens(word, observed) %>%
  anti_join(get_stopwords())
## Joining, by = "word"
tidy_bigfoot %>%
  count(word, sort = TRUE)
## # A tibble: 29,578 × 2
##    word       n
##    <chr>  <int>
##  1 like    7735
##  2 back    7675
##  3 heard   7369
##  4 saw     6042
##  5 just    5925
##  6 road    5564
##  7 see     5344
##  8 area    5273
##  9 one     5253
## 10 around  5139
## # … with 29,568 more rows
tidy_bigfoot %>%
  count(obs_id, word, sort = TRUE)
## # A tibble: 630,183 × 3
##    obs_id word         n
##     <int> <chr>    <int>
##  1   2673 subject     54
##  2   4172 creature    49
##  3    470 mr          44
##  4    470 r           42
##  5    585 ron         37
##  6    767 finger      34
##  7    276 heard       30
##  8    470 boat        30
##  9    585 scott       30
## 10   4518 tom         30
## # … with 630,173 more rows

Train a topic model

To train a topic model with the stm package, we need to create a sparse matrix from our tidy dataframe of tokens.

bigfoot_sparse <-
  tidy_bigfoot %>%
  count(obs_id, word) %>%
  cast_sparse(obs_id, word, n)

dim(bigfoot_sparse)
## [1]  4952 29578

This means there are 31 songs (i.e. documents) and different tokens (i.e. terms or words) in our dataset for modeling.

A topic model like this one models:

The most important parameter when training a topic modeling is K, the number of topics. This is like k in k-means in that it is a hyperparamter of the model and we must choose this value ahead of time. We could try multiple different values to find the best value for K, but this is a very small dataset so let’s just stick with K = 4.

library(stm)
## Warning: package 'stm' was built under R version 4.2.3
## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com
set.seed(123)

# Choose 2 for K because we know it has two classes
topic_model <- stm(bigfoot_sparse, K = 2, verbose = FALSE)
## Warning in stm(bigfoot_sparse, K = 2, verbose = FALSE): K=2 is equivalent to a
## unidimensional scaling model which you may prefer.

To get a quick view of the results, we can use summary().

summary(topic_model)
## A topic model with 2 topics, 4952 documents and a 29578 word dictionary.
## Topic 1 Top Words:
##       Highest Prob: back, just, saw, around, us, looked, two 
##       FREX: us, just, two, right, know, animal, bear 
##       Lift: us, right, bear, animal, know, just, two 
##       Score: just, us, around, two, right, know, get 
## Topic 2 Top Words:
##       Highest Prob: like, heard, road, see, one, time, area 
##       FREX: road, one, like, see, time, something, went 
##       Lift: road, one, time, see, something, like, went 
##       Score: like, road, see, one, time, something, went

Explore topic model results

To explore more deeply, we can tidy() the topic model results to get a dataframe that we can compute on. There are two possible outputs for this topic model, the “beta” matrix of topic-word probabilities and the “gamma” matrix of document-topic probabilities. Let’s start with the first.

word_topics <- tidy(topic_model, matrix = "beta")
word_topics
## # A tibble: 59,156 × 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 2010    0.0000000242
##  2     2 2010    0.000111    
##  3     1 60      0.000531    
##  4     2 60      0.00000326  
##  5     1 75      0.000240    
##  6     2 75      0.000187    
##  7     1 alabama 0.0000000334
##  8     2 alabama 0.0000922   
##  9     1 anyone  0.0000000295
## 10     2 anyone  0.00119     
## # … with 59,146 more rows

Since this is a tidy dataframe, we can manipulate it how we like, include making a visualization showing the highest probability words from each topic.

word_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>%
  ungroup() %>%
  mutate(topic = paste("Topic", topic)) %>%
  ggplot(aes(beta, reorder_within(term, beta, topic), fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(topic), scales = "free_y") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_reordered() +
  labs(x = expression(beta), y = NULL)

What about the other matrix? We also need to pass in the document_names.

observation_topics <- tidy(topic_model,
  matrix = "gamma",
  document_names = rownames(bigfoot_sparse)
)
observation_topics
## # A tibble: 9,904 × 3
##    document topic gamma
##    <chr>    <int> <dbl>
##  1 1            1 0.475
##  2 2            1 0.437
##  3 3            1 0.429
##  4 4            1 0.459
##  5 5            1 0.465
##  6 6            1 0.409
##  7 7            1 0.472
##  8 8            1 0.447
##  9 9            1 0.431
## 10 10           1 0.490
## # … with 9,894 more rows

Remember that each document (song) was modeled as a mixture of topics. How did that turn out?

# Let's sample a few for each topic (or classification) b/c there are too many to plot
set.seed(123)
observation_topics_sample <- observation_topics %>%
    group_by(topic) %>%
    sample_n(10) %>%
    ungroup()
    
observation_topics_sample
## # A tibble: 20 × 3
##    document topic gamma
##    <chr>    <int> <dbl>
##  1 2463         1 0.481
##  2 2511         1 0.488
##  3 2227         1 0.437
##  4 526          1 0.429
##  5 4292         1 0.464
##  6 2986         1 0.444
##  7 1842         1 0.458
##  8 1142         1 0.375
##  9 3372         1 0.463
## 10 3447         1 0.456
## 11 4762         2 0.534
## 12 1627         2 0.515
## 13 2757         2 0.546
## 14 953          2 0.574
## 15 4445         2 0.564
## 16 1017         2 0.527
## 17 2013         2 0.584
## 18 2888         2 0.552
## 19 2567         2 0.568
## 20 1450         2 0.569
observation_topics_sample %>%
  mutate(
    obs_id = fct_reorder(document, gamma),
    topic = factor(topic)
  ) %>%
  ggplot(aes(gamma, topic, fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(obs_id), ncol = 4) +
  scale_x_continuous(expand = c(0, 0)) +
  labs(x = expression(gamma), y = "Topic")

The songs near the top of this plot are mostly one topic, while the songs near the bottom are more a mix.

There is a TON more you can do with topic models. For example, we can take the trained topic model and, using some supplementary metadata on our documents, estimate regressions for the proportion of each document about a topic with the metadata as the predictors. For example, let’s estimate regressions for our four topics with the album name as the predictor. This asks the question, “Do the topics in Spice Girls songs change across albums?”

effects <-
  estimateEffect(
    1:2 ~ season,
    topic_model,
    tidy_bigfoot %>% distinct(obs_id, season) %>% arrange(obs_id)
  )

summary(effects)
## 
## Call:
## estimateEffect(formula = 1:2 ~ season, stmobj = topic_model, 
##     metadata = tidy_bigfoot %>% distinct(obs_id, season) %>% 
##         arrange(obs_id))
## 
## 
## Topic 1:
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.4466015  0.0013966 319.783   <2e-16 ***
## seasonSpring  -0.0001051  0.0023138  -0.045    0.964    
## seasonSummer  -0.0007387  0.0019793  -0.373    0.709    
## seasonUnknown -0.0054839  0.0055372  -0.990    0.322    
## seasonWinter  -0.0001131  0.0024355  -0.046    0.963    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Topic 2:
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.534e-01  1.397e-03 396.252   <2e-16 ***
## seasonSpring  6.018e-05  2.302e-03   0.026    0.979    
## seasonSummer  7.257e-04  1.970e-03   0.368    0.713    
## seasonUnknown 5.416e-03  5.475e-03   0.989    0.323    
## seasonWinter  6.973e-05  2.461e-03   0.028    0.977    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(effects)
## # A tibble: 10 × 6
##    topic term            estimate std.error statistic p.value
##    <int> <chr>              <dbl>     <dbl>     <dbl>   <dbl>
##  1     1 (Intercept)    0.447       0.00139  322.       0    
##  2     1 seasonSpring  -0.000129    0.00229   -0.0561   0.955
##  3     1 seasonSummer  -0.000728    0.00195   -0.373    0.709
##  4     1 seasonUnknown -0.00544     0.00549   -0.991    0.322
##  5     1 seasonWinter  -0.0000901   0.00245   -0.0368   0.971
##  6     2 (Intercept)    0.553       0.00139  397.       0    
##  7     2 seasonSpring   0.0000587   0.00231    0.0254   0.980
##  8     2 seasonSummer   0.000720    0.00195    0.369    0.712
##  9     2 seasonUnknown  0.00544     0.00550    0.991    0.322
## 10     2 seasonWinter   0.0000547   0.00245    0.0223   0.982

Looks like there is no statistical evidence of change in the lyrical content of the Spice Girls songs across these three albums!