TidyTuesday
Load the weekly Data
- Looking at sales and charts
- Text analysis of Taylor Swift
Compare overall, not per song

TidyTuesday

Load the weekly Data

Download the weekly data and make available in the tt object.

beyonce_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv')
taylor_swift_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv')
sales <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv')
charts <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/charts.csv')

Commentary: Here, he pulled data from the internet. He then renamed the files depending on the artist and their lyrics, sales and charts.

beyonce_lyrics %>%
  count(song_name, sort = TRUE)

## # A tibble: 391 × 2
##    song_name                                              n
##    <chr>                                              <int>
##  1 Lemonade Film (Script)                               336
##  2 Beyoncé VMA's 2014                                   253
##  3 Get Me Bodied (Beyoncé Experience Live)              245
##  4 Get Me Bodied (Extended Mix)                         195
##  5 Get Me Bodied (Live)                                 195
##  6 Destiny's Child Medley                               176
##  7 Get Me Bodied (Timbaland Remix) (Ft. Julio Voltio)   168
##  8 Run the World (Girls) (Dave Audé Club Remix)         131
##  9 That's How You Like It (Ft. JAY-Z)                   122
## 10 Me, Myself and I (Beyoncé Experience Live)           118
## # … with 381 more rows

#All of Beyonce's song names counted with. One row per song line. 391 songs.

taylor_swift_lyrics %>%
  count(Title, sort = TRUE)

## # A tibble: 132 × 2
##    Title                          n
##    <chr>                      <int>
##  1 22                             1
##  2 A Perfectly Good Heart         1
##  3 A Place in This World          1
##  4 Afterglow                      1
##  5 All Too Well                   1
##  6 All You Had to Do Was Stay     1
##  7 august                         1
##  8 Back to December               1
##  9 Bad Blood                      1
## 10 Begin Again                    1
## # … with 122 more rows

#All of TS song titles sorted by ABC order. one row per song 132 songs.

taylor_swift_lyrics %>%
  count(Album, sort = TRUE)

## # A tibble: 8 × 2
##   Album            n
##   <chr>        <int>
## 1 Red             19
## 2 Fearless        18
## 3 Lover           18
## 4 Speak Now       17
## 5 folklore        16
## 6 1989            15
## 7 reputation      15
## 8 Taylor Swift    14

#TS albums sorted by the number of songs in the albums. 8 albums. 

beyonce_lyrics %>%
  count(artist_name, sort = TRUE)

## # A tibble: 1 × 2
##   artist_name     n
##   <chr>       <int>
## 1 Beyoncé     22616

#A count of how many words. A count on the number of words. 

charts

## # A tibble: 140 × 8
##    artist       title        released re_release label formats chart chart_position
##    <chr>        <chr>        <chr>    <chr>      <chr> <chr>   <chr> <chr>         
##  1 Taylor Swift Taylor Swift October… March 18,… Big … CD, CD… US    5             
##  2 Taylor Swift Taylor Swift October… March 18,… Big … CD, CD… AUS   33            
##  3 Taylor Swift Taylor Swift October… March 18,… Big … CD, CD… CAN   14            
##  4 Taylor Swift Taylor Swift October… March 18,… Big … CD, CD… FRA   —             
##  5 Taylor Swift Taylor Swift October… March 18,… Big … CD, CD… GER   —             
##  6 Taylor Swift Taylor Swift October… March 18,… Big … CD, CD… IRE   59            
##  7 Taylor Swift Taylor Swift October… March 18,… Big … CD, CD… JPN   53            
##  8 Taylor Swift Taylor Swift October… March 18,… Big … CD, CD… NZ    38            
##  9 Taylor Swift Taylor Swift October… March 18,… Big … CD, CD… SWE   —             
## 10 Taylor Swift Taylor Swift October… March 18,… Big … CD, CD… UK    81            
## # … with 130 more rows

#The song placement when they hit the charts. most popular albums per country. 10 countries.

sales %>%
  filter(title == "1989")

## # A tibble: 6 × 8
##   artist       title country    sales released         re_release label  formats
##   <chr>        <chr> <chr>      <dbl> <chr>            <chr>      <chr>  <chr>  
## 1 Taylor Swift 1989  WW      10100000 October 27, 2014 <NA>       Big M… CD, CD…
## 2 Taylor Swift 1989  US       6215000 October 27, 2014 <NA>       Big M… CD, CD…
## 3 Taylor Swift 1989  CAN       542000 October 27, 2014 <NA>       Big M… CD, CD…
## 4 Taylor Swift 1989  FRA        70000 October 27, 2014 <NA>       Big M… CD, CD…
## 5 Taylor Swift 1989  JPN       268200 October 27, 2014 <NA>       Big M… CD, CD…
## 6 Taylor Swift 1989  UK       1250000 October 27, 2014 <NA>       Big M… CD, CD…

#Sales across the world for the album 1989 of TS. filtered the title, countries pop up. will clean

Commentary: In this chunk he separated the song names from the albums. Then made sure have a count of words on each artists songs. Also looked at the charts to compare each artist.

Looking at sales and charts

sales %>%
  filter(country == "US") %>%
  mutate(title = fct_reorder(title, sales)) %>%
  ggplot(aes(sales, title, fill = artist)) +
  geom_col() +
  scale_x_continuous(labels = dollar) +
  labs(x = "Sales (US)",
       y = "")

#Here he used the sales data. He filtered the US (13 albums). He mutated the title to be title by sales. then he created a ggplot, using the sales as the x and the title in y and filled it by artist. He made the data in a geom_col. He scaled it by money in dollars. He labeled x as sales in us. and y as the variables (album title). Taylor outsold beyonce by 3 albums. scale x continuous by dollar. 

sales %>%
  filter(country %in% c("World", "WW")) %>%
  mutate(title = fct_reorder(title, sales)) %>%
  ggplot(aes(sales, title, fill = artist)) +
  geom_col() +
  scale_x_continuous(labels = dollar) +
  labs(x = "Sales (World)",
       y = "")

#He did the same as above but with the entire world data. World data is written by WW and World so he added both to make sure he had to differences in. 

charts %>%
  filter(chart == "US")

## # A tibble: 14 × 8
##    artist       title  released  re_release label formats   chart chart_position
##    <chr>        <chr>  <chr>     <chr>      <chr> <chr>     <chr> <chr>         
##  1 Taylor Swift Taylo… October … March 18,… Big … CD, CD+G… US    5             
##  2 Taylor Swift Fearl… November… October 2… Big … CD, CD+G… US    1             
##  3 Taylor Swift Speak… October … <NA>       Big … CD, CD+G… US    1             
##  4 Taylor Swift Red    October … <NA>       Big … CD, CD+G… US    1             
##  5 Taylor Swift 1989   October … <NA>       Big … CD, CD+G… US    1             
##  6 Taylor Swift Reput… November… <NA>       Big … CD, CD+G… US    1             
##  7 Taylor Swift Lover  August 2… <NA>       Repu… CD, LP, … US    1             
##  8 Taylor Swift Folkl… July 24,… <NA>       Repu… CD, LP, … US    1             
##  9 Beyoncé      Dange… June 23,… <NA>       Colu… CD, LP, … US    1             
## 10 Beyoncé      B'Day  Septembe… <NA>       Sony… CD, CD/D… US    1             
## 11 Beyoncé      I Am.… November… <NA>       Musi… CD, CD/D… US    1             
## 12 Beyoncé      4      June 24,… <NA>       Park… CD, digi… US    1             
## 13 Beyoncé      Beyon… December… <NA>       Park… CD, CD/D… US    1             
## 14 Beyoncé      Lemon… April 23… <NA>       Park… CD/DVD, … US    1

#he looked at charts per country. Country is labeled as US on the data table. Filter used to see only the US data.

Commentary: In this chunk he compared artists sales by world or US sales. He used two versions of World because they both popped up on data set. New feature of scale_x_continuous used dollars on a continous scale depending on where they stopped.

Text analysis of Taylor Swift

release_dates <- charts %>%
  distinct(album = title, released) %>%
  mutate(album = fct_recode(album,
                            folklore = "Folklore",
                            reputation = "Reputation")) %>%
  mutate(released = str_remove(released, " \\(.*")) %>%
  mutate(released = mdy(released))
# This pulls out albums by month and year release date. 

taylor_swift_words <- taylor_swift_lyrics %>%
  rename_all(str_to_lower) %>%
  select(-artist) %>%
  unnest_tokens(word, lyrics) %>%
  anti_join(stop_words, by = "word") %>%
  inner_join(release_dates, by = "album") %>%
  mutate(album = fct_reorder(album, released))
# cleaning the names. renaming with string to lower(renaming all)
# unnest tokens is that we create words from lyrics. anti_join will be used to separate words and get words to remove. mutate album by release date

Commentary: This chunk cleans names and separates each word in the lyrics to be independent of each other. This was done using anti join to separate the words.

taylor_swift_words %>%
  count(word, sort = TRUE) %>%
  head(25) %>%
  mutate(word = fct_reorder(word, n)) %>%
  ggplot(aes(n, word)) +
  geom_col()

#count by number of words. the first 25 using head. mutate will reorder the words by how many there are. ggplot will show the most words used in taylor swift songs using columns.

Commentary: This chunk shows the 25 most used songs in taylors lyrics across all albums and songs.

ts_tf_idf <- taylor_swift_words %>%
  count(album, word) %>%
  bind_tf_idf(word, album, n) %>%
  arrange(desc(tf_idf))
#Wants to tell the difference between each album word count. TF_IDF is term frequency inverse document frequency to see how many albums have this word. How much of the album is this word. To see how many times they appear. Formula to see how frequently a word is used in one album compared to another.  A phrase more unique in one album. Example: out of the woods is 2% of the album making it the most frequent phrase in the album.

ts_tf_idf %>%
  group_by(album) %>%
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, album)) %>%
  ggplot(aes(tf_idf, word)) +
  geom_col() +
  facet_wrap(~ album, scales = "free_y") +
  scale_y_reordered()

#group by album. tf_idf will show only 10. facet by album. ungroup is used to ungroup words. mutating by word to reorder. scales will equal free_y. not sure why. slice max is to slice by 10 on each facet wrap. slicemax is from lowest to highest numbers. with_ties=false means it doesnt tie any of the albums in this case, together.

Commentary: This chunk was to show the most frequent words per album. This used slice max meaning each facet wrap had the first 10 most used words. Tf_idf is the frequency of words used.

ts_lo <- taylor_swift_words %>%
  count(album, word) %>%
  bind_log_odds(album, word, n) %>%
  arrange(desc(log_odds_weighted))
#bind long odds. Set is album, future is the word. ex: how much more common is the word woods in this album compared across the others.

ts_lo %>%
  group_by(album) %>%
  slice_max(log_odds_weighted, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, log_odds_weighted, album)) %>%
  ggplot(aes(log_odds_weighted, word)) +
  geom_col() +
  facet_wrap(~ album, scales = "free_y") +
  scale_y_reordered()

#by changing this one to log_odds_weighteed, we see common words in one album compared to other albums. (data set is the same as the one in above chunk so most of the data is the same)

filler <- c("ah", "uh", "ha", "ey", "eh", "eeh", "huh")
#filtering out common filler words across all albums.

ts_lo %>%
  filter(word %in% filler) %>%
  mutate(word = reorder_within(word, n, album)) %>%
  ggplot(aes(n, word)) +
  geom_col() +
  facet_wrap(~ album, scales = "free_y") +
  scale_y_reordered() +
  labs(title = "The filler words in Taylor Swift lyrics have changed across albums",
       x = "# of appearances in album",
       y = "")

#in this one, he filters out the filler words from previous data set. They are easier to read with the number of appearances per album. he uses geom column and facet wrap to differentiate between albums.

Commentary: This chunk showed what the most frequent words were. He saw there was some interesting filler words and filtered the filler words used across albums by TS. He filtered out the words and facet wrapped it by album.

ts <- taylor_swift_lyrics %>%
  rename_all(str_to_lower) %>%
  rename(song = title) %>%
  select(-album)
#for this set and the one below, he tries to make both data sets fit a similar format to compare. He renames ts from title to song. 

beyonce <- beyonce_lyrics %>%
  select(artist = artist_name, song = song_name, lyrics = line)
#he renames artist name to artist, song name to song, and line to lyrics. this is because he will blend both TS and beyonce together. 

artist_song_words_raw <- bind_rows(ts, beyonce) %>%
  unnest_tokens(word, lyrics) %>%
  count(artist, song, word)
#in this tb, he combines ts and beyonce lyrics. unnest tokens by word and lyrics. This makes it one row per word per artist. Count will count artists title word. 

artist_song_words <- artist_song_words_raw %>%
  anti_join(stop_words, by = "word")
#anti join the words by word. one row per song per word per artist.

Commentary: This chunk merged beyonce and taylors songs and words. First, they had to get the data sets to be similar in format by selecting variables and renaming them. anti join was used to separate each word again.

Compare overall, not per song

#lollipop graphs

by_artist_word <- artist_song_words %>%
  group_by(artist, word) %>%
  summarize(num_songs = n(),
            num_words = sum(n)) %>%
  mutate(pct_words = num_words / sum(num_words)) %>%
  group_by(word) %>%
  mutate(num_words_total = sum(num_words)) %>%
  ungroup()
#summarize songs by artist word. mutate is the percentage of words eualing the number of words over the sum of words. group by word.

word_differences <- by_artist_word %>%
  bind_log_odds(artist, word, num_words) %>%
  arrange(desc(abs(log_odds_weighted))) %>%
  filter(artist == "Beyoncé") %>%
  slice_max(num_words_total, n = 100, with_ties = FALSE) %>%
  slice_max(abs(log_odds_weighted), n = 25, with_ties = FALSE) %>%
  mutate(word = fct_reorder(word, log_odds_weighted)) %>%
  mutate(direction = ifelse(log_odds_weighted > 0, "Beyoncé", "Taylor Swift"))
#bing log odds is the weighted odds of how much more a word is used. they arrange them in descending order by the log odds weighted across all the albums and artists. filters just for beyonce songs. slicing only the top 100 words. n=25 words that separate them the most. mutate will be based on words weighted greater than zero. greater than 0 is beyonve and the other would be TS.

word_differences %>%
  ggplot(aes(log_odds_weighted, word, fill = direction)) +
  geom_col() +
  scale_x_continuous(breaks = log(2 ^ seq(-6, 9, 3)),
                     labels = paste0(2 ^ abs(seq(-6, 9, 3)), "X")) +
  labs(x = "Relative use in Beyoncé vs Taylor Swift (weighted)",
       y = "",
       title = "Which words most distinguish Beyoncé and Taylor Swift songs?",
       subtitle = "Among the 100 words most used by the artists (combined)",
       fill = "")

#used previous tb to make a graph using the odds of the lyrics weighted by word filled in the direction that they go based off last data set. He labels the title, subtitle but not y and fill to
#make it cleaner. Scale makes the breaks 

x_labels <- paste0(2 ^ abs(seq(-6, 9, 3)), "X")
x_labels <- ifelse(x_labels == "1X", "Same", x_labels)
#label the columns as 1x. and so on to make above graph neater.replaces one x with "same" which is the midline. 

word_differences %>%
  ggplot(aes(log_odds_weighted, word)) +
  geom_col(width = .1) +
  geom_point(aes(size = num_words_total, color = direction)) +
  geom_vline(lty = 2, xintercept = 0) +
  scale_x_continuous(breaks = log(2 ^ seq(-6, 9, 3)),
                     labels = x_labels) +
  labs(x = "Relative use in Beyoncé vs Taylor Swift (weighted)",
       y = "",
       title = "Which words most distinguish Beyoncé and Taylor Swift songs?",
       subtitle = "Among the 100 words most used by the artists (combined)",
       color = "",
       size = "# of words\n(both artists)")

#same above data. color equaling direction, size will equal number of words per artist. this graphs shows us which songs were not only more common between artist but more common among the artist using point size to tell the difference. This again was weighted across the artists. geom vline makes the middle line to separate the midpoint.

Commentary: In this chunk made lollipop graphs to compare the most used popular words from both Beyonce and TS. Bind logs is the weighted count of the words. Geom v line is the midline that separates the variables.

comparison <- by_artist_word %>%
  select(artist, word, pct_words, num_words_total) %>%
  pivot_wider(names_from = artist,
              values_from = pct_words,
              values_fill = list(pct_words = 0)) %>%
  janitor::clean_names() %>%
  slice_max(num_words_total, n = 200, with_ties = FALSE)
#selects artists, pct words and number of words total. Pivot wider by names form artists. cleans names with janitor. slices the data per number of words total, showing only 200 per slice and no ties. 

comparison %>%
  ggplot(aes(taylor_swift, beyonce)) +
  geom_abline(color = "red") +
  geom_point() +
  geom_text(aes(label = word), vjust = 1, hjust = 1, check_overlap = TRUE) +
  scale_x_log10(labels = percent) +
  scale_y_log10(labels = percent)

#Same as the lollipop graphs but shown in a less simplified way comparing both TS and Beyonce. there is a mid line using geom_abline in red. However, with this graph we see words they have in common.
#Ex: love, baby, wanna, yeah is used the most with both TS and Beyonce. also see what is more common in TS and beyonce like halo and ah.

comparison %>%
  ggplot(aes(num_words_total, beyonce / taylor_swift)) +
  geom_hline(yintercept = 1) +
  geom_point() +
  geom_text(aes(label = word), vjust = 1, hjust = 1, check_overlap = TRUE) +
  scale_x_log10(labels = percent) +
  scale_y_log10()

#this will be beyonce divided by TS. geom abline is no longer what is needed. h line is now the dividing factor in this at the y intercept. How much further they are above the line is beyonce and further down the line is TS.

Commentary: This chunk shows another way to compare the most frequently used wrods per artist. In this one, we can see the ones that are closer to the line are used for both. the ones further from the midline are used mostly by one artist. Hline was used to be a median in the data on the y axis.

We could have done:

Sentiment analysis: who is happier, what albums are happier?
Supervised learning: can you distinguish a Taylor Swift song from a Beyoncé song?

Tidy Tuesday: Taylor Swift and Beyonce Lyrics

2020-09-29

TidyTuesday

Load the weekly Data

Looking at sales and charts

Text analysis of Taylor Swift

Compare overall, not per song