Our modeling goal is to “discover” topics in the lyrics of Spice Girls songs. Instead of a supervised or predictive model where our observations have labels, this is an unsupervised approach.
https://juliasilge.com/blog/spice-girls/
equivalent terms between bigfoot and spice girls data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidytext)
bigfoot_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-09-13/bigfoot.csv')
## Rows: 5021 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): observed, location_details, county, state, season, title, classif...
## dbl (17): latitude, longitude, number, temperature_high, temperature_mid, t...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bigfoot_raw %>%
count(classification)
## # A tibble: 3 × 2
## classification n
## <chr> <int>
## 1 Class A 2481
## 2 Class B 2510
## 3 Class C 30
bigfoot <-
bigfoot_raw %>%
filter(classification != "Class C", !is.na(observed)) %>%
mutate(
classification = case_when(
classification == "Class A" ~ "sighting",
classification == "Class B" ~ "possible"
)
) %>%
# Create a new variable, observation id, equivalent of song_name in the Spice Girls case
mutate(obs_id = row_number()) %>%
select(obs_id, observed, season, county, state)
bigfoot
## # A tibble: 4,953 × 5
## obs_id observed season county state
## <int> <chr> <chr> <chr> <chr>
## 1 1 "I was canoeing on the Sipsey river in Alabama. I… Summer Winst… Alab…
## 2 2 "Ed L. was salmon fishing with a companion in Pri… Fall Valde… Alas…
## 3 3 "While attending U.R.I in the Fall of 1974,I woul… Fall Washi… Rhod…
## 4 4 "Hello, My name is Doug and though I am very relu… Summer York … Penn…
## 5 5 "It was May 1984. Two friends and I were up in th… Spring Yamhi… Oreg…
## 6 6 "My two children and I were returning from Altus,… Fall Washi… Okla…
## 7 7 "I was staying the night with a friends of mine. … Summer Washi… Ohio
## 8 8 "Well last year I was night fishing 9/2010 at Ama… Fall Westc… New …
## 9 9 "I grew up in Northwestern Nevada along the Calif… Fall Washo… Neva…
## 10 10 "heh i kinda feel a little dumb that im reporting… Fall Warre… New …
## # … with 4,943 more rows
tidy_bigfoot <-
bigfoot %>%
unnest_tokens(word, observed) %>%
anti_join(get_stopwords())
## Joining, by = "word"
tidy_bigfoot %>%
count(word, sort = TRUE)
## # A tibble: 29,578 × 2
## word n
## <chr> <int>
## 1 like 7735
## 2 back 7675
## 3 heard 7369
## 4 saw 6042
## 5 just 5925
## 6 road 5564
## 7 see 5344
## 8 area 5273
## 9 one 5253
## 10 around 5139
## # … with 29,568 more rows
tidy_bigfoot %>%
count(obs_id, word, sort = TRUE)
## # A tibble: 630,183 × 3
## obs_id word n
## <int> <chr> <int>
## 1 2673 subject 54
## 2 4172 creature 49
## 3 470 mr 44
## 4 470 r 42
## 5 585 ron 37
## 6 767 finger 34
## 7 276 heard 30
## 8 470 boat 30
## 9 585 scott 30
## 10 4518 tom 30
## # … with 630,173 more rows
To train a topic model with the stm package, we need to create a sparse matrix from our tidy dataframe of tokens.
bigfoot_sparse <-
tidy_bigfoot %>%
count(obs_id, word) %>%
cast_sparse(obs_id, word, n)
dim(bigfoot_sparse)
## [1] 4952 29578
This means there are 31 songs (i.e. documents) and different tokens (i.e. terms or words) in our dataset for modeling.
A topic model like this one models:
The most important parameter when training a topic modeling is K, the number of topics. This is like k in k-means in that it is a hyperparamter of the model and we must choose this value ahead of time. We could try multiple different values to find the best value for K, but this is a very small dataset so let’s just stick with K = 4.
library(stm)
## Warning: package 'stm' was built under R version 4.2.3
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
set.seed(123)
# Choose 2 for K because we know it has two classes
topic_model <- stm(bigfoot_sparse, K = 2, verbose = FALSE)
## Warning in stm(bigfoot_sparse, K = 2, verbose = FALSE): K=2 is equivalent to a
## unidimensional scaling model which you may prefer.
To get a quick view of the results, we can use summary().
summary(topic_model)
## A topic model with 2 topics, 4952 documents and a 29578 word dictionary.
## Topic 1 Top Words:
## Highest Prob: back, just, saw, around, us, looked, two
## FREX: us, just, two, right, know, animal, bear
## Lift: us, right, bear, animal, know, just, two
## Score: just, us, around, two, right, know, get
## Topic 2 Top Words:
## Highest Prob: like, heard, road, see, one, time, area
## FREX: road, one, like, see, time, something, went
## Lift: road, one, time, see, something, like, went
## Score: like, road, see, one, time, something, went
To explore more deeply, we can tidy() the topic model results to get a dataframe that we can compute on. There are two possible outputs for this topic model, the “beta” matrix of topic-word probabilities and the “gamma” matrix of document-topic probabilities. Let’s start with the first.
word_topics <- tidy(topic_model, matrix = "beta")
word_topics
## # A tibble: 59,156 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 2010 0.0000000242
## 2 2 2010 0.000111
## 3 1 60 0.000531
## 4 2 60 0.00000326
## 5 1 75 0.000240
## 6 2 75 0.000187
## 7 1 alabama 0.0000000334
## 8 2 alabama 0.0000922
## 9 1 anyone 0.0000000295
## 10 2 anyone 0.00119
## # … with 59,146 more rows
Since this is a tidy dataframe, we can manipulate it how we like, include making a visualization showing the highest probability words from each topic.
word_topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
mutate(topic = paste("Topic", topic)) %>%
ggplot(aes(beta, reorder_within(term, beta, topic), fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(topic), scales = "free_y") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_reordered() +
labs(x = expression(beta), y = NULL)
What about the other matrix? We also need to pass in the document_names.
observation_topics <- tidy(topic_model,
matrix = "gamma",
document_names = rownames(bigfoot_sparse)
)
observation_topics
## # A tibble: 9,904 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 1 1 0.475
## 2 2 1 0.437
## 3 3 1 0.429
## 4 4 1 0.459
## 5 5 1 0.465
## 6 6 1 0.409
## 7 7 1 0.472
## 8 8 1 0.447
## 9 9 1 0.431
## 10 10 1 0.490
## # … with 9,894 more rows
Remember that each document (song) was modeled as a mixture of topics. How did that turn out?
# Let's sample a few for each topic (or classification) b/c there are too many to plot
set.seed(123)
observation_topics_sample <- observation_topics %>%
group_by(topic) %>%
sample_n(10) %>%
ungroup()
observation_topics_sample
## # A tibble: 20 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 2463 1 0.481
## 2 2511 1 0.488
## 3 2227 1 0.437
## 4 526 1 0.429
## 5 4292 1 0.464
## 6 2986 1 0.444
## 7 1842 1 0.458
## 8 1142 1 0.375
## 9 3372 1 0.463
## 10 3447 1 0.456
## 11 4762 2 0.534
## 12 1627 2 0.515
## 13 2757 2 0.546
## 14 953 2 0.574
## 15 4445 2 0.564
## 16 1017 2 0.527
## 17 2013 2 0.584
## 18 2888 2 0.552
## 19 2567 2 0.568
## 20 1450 2 0.569
observation_topics_sample %>%
mutate(
obs_id = fct_reorder(document, gamma),
topic = factor(topic)
) %>%
ggplot(aes(gamma, topic, fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(obs_id), ncol = 4) +
scale_x_continuous(expand = c(0, 0)) +
labs(x = expression(gamma), y = "Topic")
The songs near the top of this plot are mostly one topic, while the songs near the bottom are more a mix.
There is a TON more you can do with topic models. For example, we can take the trained topic model and, using some supplementary metadata on our documents, estimate regressions for the proportion of each document about a topic with the metadata as the predictors. For example, let’s estimate regressions for our four topics with the album name as the predictor. This asks the question, “Do the topics in Spice Girls songs change across albums?”
effects <-
estimateEffect(
1:2 ~ season,
topic_model,
tidy_bigfoot %>% distinct(obs_id, season) %>% arrange(obs_id)
)
summary(effects)
##
## Call:
## estimateEffect(formula = 1:2 ~ season, stmobj = topic_model,
## metadata = tidy_bigfoot %>% distinct(obs_id, season) %>%
## arrange(obs_id))
##
##
## Topic 1:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4466015 0.0013966 319.783 <2e-16 ***
## seasonSpring -0.0001051 0.0023138 -0.045 0.964
## seasonSummer -0.0007387 0.0019793 -0.373 0.709
## seasonUnknown -0.0054839 0.0055372 -0.990 0.322
## seasonWinter -0.0001131 0.0024355 -0.046 0.963
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 2:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.534e-01 1.397e-03 396.252 <2e-16 ***
## seasonSpring 6.018e-05 2.302e-03 0.026 0.979
## seasonSummer 7.257e-04 1.970e-03 0.368 0.713
## seasonUnknown 5.416e-03 5.475e-03 0.989 0.323
## seasonWinter 6.973e-05 2.461e-03 0.028 0.977
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(effects)
## # A tibble: 10 × 6
## topic term estimate std.error statistic p.value
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 (Intercept) 0.447 0.00139 322. 0
## 2 1 seasonSpring -0.000129 0.00229 -0.0561 0.955
## 3 1 seasonSummer -0.000728 0.00195 -0.373 0.709
## 4 1 seasonUnknown -0.00544 0.00549 -0.991 0.322
## 5 1 seasonWinter -0.0000901 0.00245 -0.0368 0.971
## 6 2 (Intercept) 0.553 0.00139 397. 0
## 7 2 seasonSpring 0.0000587 0.00231 0.0254 0.980
## 8 2 seasonSummer 0.000720 0.00195 0.369 0.712
## 9 2 seasonUnknown 0.00544 0.00550 0.991 0.322
## 10 2 seasonWinter 0.0000547 0.00245 0.0223 0.982
Looks like there is no statistical evidence of change in the lyrical content of the Spice Girls songs across these three albums!