CodeAlong10

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

lyrics <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-14/lyrics.csv")

## Rows: 1885 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): artist_name, album_name, song_name, section_name, line, section_artist
## dbl (3): track_number, song_id, line_number
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

lyrics %>% distinct(album_name)

## # A tibble: 3 × 1
##   album_name
##   <chr>     
## 1 Spice     
## 2 Spiceworld
## 3 Forever

lyrics %>% distinct(album_name, song_name)

## # A tibble: 31 × 2
##    album_name song_name                 
##    <chr>      <chr>                     
##  1 Spice      "Wannabe"                 
##  2 Spice      "Say You\x92ll Be There"  
##  3 Spice      "2 Become 1"              
##  4 Spice      "Love Thing"              
##  5 Spice      "Last Time Lover"         
##  6 Spice      "Mama"                    
##  7 Spice      "Who Do You Think You Are"
##  8 Spice      "Something Kinda Funny"   
##  9 Spice      "Naked"                   
## 10 Spice      "If U Can\x92t Dance"     
## # ℹ 21 more rows

library(tidytext)

tidy_lyrics <-
  lyrics %>%
  mutate(song_name = str_replace_all(song_name, "\x92", "'")) %>%
  unnest_tokens(word, line) %>%
  anti_join(get_stopwords())

## Joining with `by = join_by(word)`

tidy_lyrics %>%
  count(word, sort = TRUE)

## # A tibble: 979 × 2
##    word      n
##    <chr> <int>
##  1 get     153
##  2 love    137
##  3 know    124
##  4 time    106
##  5 wanna   102
##  6 never   101
##  7 oh       88
##  8 yeah     88
##  9 la       85
## 10 got      82
## # ℹ 969 more rows

tidy_lyrics %>%
  count(song_name, word, sort = TRUE)

## # A tibble: 2,206 × 3
##    song_name                       word           n
##    <chr>                           <chr>      <int>
##  1 Saturday Night Divas            get           91
##  2 Spice Up Your Life              la            64
##  3 If U Can't Dance                dance         60
##  4 Holler                          holler        48
##  5 Never Give Up on the Good Times never         47
##  6 Move Over                       generation    41
##  7 Saturday Night Divas            deeper        41
##  8 Move Over                       yeah          39
##  9 Something Kinda Funny           got           39
## 10 Never Give Up on the Good Times give          38
## # ℹ 2,196 more rows

Train a topic model

lyrics_sparse <-
  tidy_lyrics %>%
  count(song_name, word) %>%
  cast_sparse(song_name, word, n)

dim(lyrics_sparse)

## [1]  31 979

library(stm)

## stm v1.3.6.1 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

set.seed(123)
topic_model <- stm(lyrics_sparse, K = 4, verbose = FALSE)

summary(topic_model)

## A topic model with 4 topics, 31 documents and a 979 word dictionary.

## Topic 1 Top Words:
##       Highest Prob: get, wanna, time, night, right, deeper, come 
##       FREX: deeper, saturday, comin, get, lover, night, last 
##       Lift: achieve, saying, tonight, another, anyway, blameless, breaking 
##       Score: deeper, saturday, lover, get, wanna, night, comin 
## Topic 2 Top Words:
##       Highest Prob: dance, yeah, generation, know, next, love, naked 
##       FREX: next, naked, denying, foolin, nobody, wants, meant 
##       Lift: admit, bein, check, d'ya, defeat, else, foolin 
##       Score: next, naked, dance, generation, denying, foolin, nobody 
## Topic 3 Top Words:
##       Highest Prob: got, holler, make, love, oh, something, play 
##       FREX: holler, kinda, swing, funny, yay, use, trust 
##       Lift: anyone, bottom, driving, fantasy, follow, hoo, long 
##       Score: holler, swing, kinda, funny, yay, driving, loving 
## Topic 4 Top Words:
##       Highest Prob: la, never, love, give, time, know, way 
##       FREX: times, tried, swear, la, bring, promise, viva 
##       Lift: able, certain, love's, rely, affection, shy, replace 
##       Score: la, times, swear, shake, viva, chicas, front

Explore topic model results

word_topics <- tidy(topic_model, matrix = "beta")
word_topics

## # A tibble: 3,916 × 3
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     1 achieve 1.66e- 3
##  2     2 achieve 2.14e-21
##  3     3 achieve 1.75e-49
##  4     4 achieve 5.18e-36
##  5     1 baby    1.20e- 2
##  6     2 baby    1.44e- 2
##  7     3 baby    1.29e-15
##  8     4 baby    5.04e- 3
##  9     1 back    1.94e- 2
## 10     2 back    5.49e- 4
## # ℹ 3,906 more rows

word_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>%
  ungroup() %>%
  mutate(topic = paste("Topic", topic)) %>%
  ggplot(aes(beta, reorder_within(term, beta, topic), fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(topic), scales = "free_y") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_reordered() +
  labs(x = expression(beta), y = NULL)

song_topics <- tidy(topic_model,
  matrix = "gamma",
  document_names = rownames(lyrics_sparse)
)
song_topics

## # A tibble: 124 × 3
##    document                   topic    gamma
##    <chr>                      <int>    <dbl>
##  1 2 Become 1                     1 0.714   
##  2 Denying                        1 0.00163 
##  3 Do It                          1 0.996   
##  4 Get Down With Me               1 0.947   
##  5 Goodbye                        1 0.00106 
##  6 Holler                         1 0.00103 
##  7 If U Can't Dance               1 0.000942
##  8 If You Wanna Have Some Fun     1 0.00722 
##  9 Last Time Lover                1 0.998   
## 10 Let Love Lead the Way          1 0.00175 
## # ℹ 114 more rows

song_topics %>%
  mutate(
    song_name = fct_reorder(document, gamma),
    topic = factor(topic)
  ) %>%
  ggplot(aes(gamma, topic, fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(song_name), ncol = 4) +
  scale_x_continuous(expand = c(0, 0)) +
  labs(x = expression(gamma), y = "Topic")

effects <-
  estimateEffect(
    1:4 ~ album_name,
    topic_model,
    tidy_lyrics %>% distinct(song_name, album_name) %>% arrange(song_name)
  )

summary(effects)

## 
## Call:
## estimateEffect(formula = 1:4 ~ album_name, stmobj = topic_model, 
##     metadata = tidy_lyrics %>% distinct(song_name, album_name) %>% 
##         arrange(song_name))
## 
## 
## Topic 1:
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)            0.1787     0.1312   1.362    0.184
## album_nameSpice        0.1199     0.1892   0.634    0.531
## album_nameSpiceworld   0.1139     0.1862   0.612    0.546
## 
## 
## Topic 2:
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)            0.1444     0.1325   1.090    0.285
## album_nameSpice        0.1357     0.1879   0.722    0.476
## album_nameSpiceworld   0.1486     0.1846   0.805    0.427
## 
## 
## Topic 3:
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           0.27150    0.12085   2.247   0.0327 *
## album_nameSpice       0.01954    0.16752   0.117   0.9080  
## album_nameSpiceworld -0.25776    0.16700  -1.543   0.1339  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Topic 4:
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           0.405559   0.140820   2.880  0.00754 **
## album_nameSpice      -0.273207   0.202200  -1.351  0.18746   
## album_nameSpiceworld -0.007134   0.194246  -0.037  0.97096   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tidy(effects)

## # A tibble: 12 × 6
##    topic term                 estimate std.error statistic p.value
##    <int> <chr>                   <dbl>     <dbl>     <dbl>   <dbl>
##  1     1 (Intercept)           0.177       0.132    1.34   0.190  
##  2     1 album_nameSpice       0.120       0.189    0.633  0.532  
##  3     1 album_nameSpiceworld  0.115       0.188    0.608  0.548  
##  4     2 (Intercept)           0.145       0.133    1.09   0.283  
##  5     2 album_nameSpice       0.135       0.187    0.722  0.476  
##  6     2 album_nameSpiceworld  0.150       0.185    0.813  0.423  
##  7     3 (Intercept)           0.272       0.120    2.26   0.0316 
##  8     3 album_nameSpice       0.0167      0.167    0.100  0.921  
##  9     3 album_nameSpiceworld -0.259       0.166   -1.57   0.129  
## 10     4 (Intercept)           0.404       0.140    2.89   0.00739
## 11     4 album_nameSpice      -0.273       0.196   -1.39   0.175  
## 12     4 album_nameSpiceworld -0.00502     0.193   -0.0260 0.979

1.The goal of the analysis is to apply topic modeling to a collection of song lyrics to identify recurring themes across songs and understand the distribution of these themes within the albums.

The data contains 1,885 rows with song lyrics from three albums. It includes nine columns, detailing the artist name, album name, track number, song name, song ID, section name, line number, line of lyrics, and section artist. This dataset is used to examine the lyrical content of songs.

The key variables used in the analysis are ‘album_name’, ‘song_name’, and ‘word’. ‘Album_name’ and ‘song_name’ are categorical variables representing the albums and songs, respectively. ‘Word’ is derived from the ‘line’ of lyrics and represents individual words in the lyrics, which are used to detect and group topics.

2 The original data consists of lyrics with associated metadata. For modeling, this data is transformed by:

Removing duplicate entries for albums and songs. Replacing certain characters in song names for consistency. Tokenizing lyrics into individual words. Removing stopwords to focus on more meaningful words. The transformed data is then put into a sparse matrix format suitable for topic modeling, which is essential because the model requires numerical input to analyze text data.

3 The model used in the analysis is a Structural Topic Model (STM), a machine learning technique that is capable of discovering the hidden thematic structure within a large collection of documents.

The selection of the optimal value for ‘k’, the number of topics, is not explicitly described in the provided information. Typically, the choice of ‘k’ would be determined through methods like cross-validation, evaluating model fit, or expert judgment. In this analysis, ‘k’ is set to 4.

4 The analysis revealed four distinct topics within the song lyrics. The STM provided a list of words most associated with each topic, which can be interpreted to understand the themes each topic represents. It also examined the prevalence of these topics across different albums. The model’s findings could assist in understanding the thematic focus of the artists during different creative periods represented by each album.

CodeAlong10

Anton Jellvik

2023-11-07

Train a topic model

Explore topic model results