TidyTuesday

Dowload the weekly data and make available in the tt object.

tt <- tt_load("2020-09-29")

## --- Compiling #TidyTuesday Information for 2020-09-29 ----

## --- There are 4 files available ---

## --- Starting Download ---

## 
##  Downloading file 1 of 4: `beyonce_lyrics.csv`
##  Downloading file 2 of 4: `taylor_swift_lyrics.csv`
##  Downloading file 3 of 4: `sales.csv`
##  Downloading file 4 of 4: `charts.csv`

## --- Download complete ---

beyonce_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv')

## Parsed with column specification:
## cols(
##   line = col_character(),
##   song_id = col_double(),
##   song_name = col_character(),
##   artist_id = col_double(),
##   artist_name = col_character(),
##   song_line = col_double()
## )

taylor_swift_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv')

## Parsed with column specification:
## cols(
##   Artist = col_character(),
##   Album = col_character(),
##   Title = col_character(),
##   Lyrics = col_character()
## )

sales <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv')

## Parsed with column specification:
## cols(
##   artist = col_character(),
##   title = col_character(),
##   country = col_character(),
##   sales = col_double(),
##   released = col_character(),
##   re_release = col_character(),
##   label = col_character(),
##   formats = col_character()
## )

charts <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/charts.csv')

## Parsed with column specification:
## cols(
##   artist = col_character(),
##   title = col_character(),
##   released = col_character(),
##   re_release = col_character(),
##   label = col_character(),
##   formats = col_character(),
##   chart = col_character(),
##   chart_position = col_character()
## )

There was a bug so we had to read the code in manually

beyonce_lyrics %>%
  count(song_name, sort = TRUE)

taylor_swift_lyrics %>%
  count(Title, sort = TRUE)

taylor_swift_lyrics %>%
  count(Album, sort = TRUE)

beyonce_lyrics %>%
  count(artist_name, sort = TRUE)

#charts %>%
  #View()

sales %>%
  count(artist, title, sort = TRUE)

sales %>%
  filter(title == "1989")

On David Robinson’s end when comparing Beyonce and Taylor Swift in terms of the formatting it looked different. Beyonce had 2 columns while Taylor had 4. On my end it was the same where each artist had 2 columns. When looking at Taylors albums I was given a tibble and David was given more of a chart. When looking at Charts we see one row per album, so we fixed it to look at the data better and this gave me a huge list with dates of release and in what countries as well. When looking at sales, David was given a more in depth look than I could get on my end.

Looking at sales and charts

sales %>%
  count(country, sort =TRUE, wt = sales) %>%
  filter(country == "US")

sales %>%
  filter(country == "US") %>%
  mutate(title = fct_reorder(title, sales)) %>%
  ggplot(aes(sales, title, fill = artist)) +
  geom_col() +
  scale_x_continuous(labels = dollar) +
  labs(x = "Sales (US)")

sales %>%
  filter(country %in% c("World", "WW")) %>%
  mutate(title = fct_reorder(title, sales)) %>%
  ggplot(aes(sales, title, fill = artist)) +
  geom_col() +
  scale_x_continuous(labels = dollar) +
  labs(x = "Sales (US)")

sales %>%
  filter(country %in% c("World", "WW")) %>%
  mutate(title = fct_reorder(title, sales)) %>%
  ggplot(aes(sales, title, fill = artist)) +
  geom_col() +
  scale_x_continuous(labels = dollar) +
  labs(x = "Sales (World)", y = "")

#charts %>%
  #filter(chart == "US") %>%
  #View()

Here we were looking at sales in the US and we filtered them based on artist. And we saw that Taylor outsold Beyonce a little bit in terms of dollar sales. Looking World wide Fearless was the top selling album out of the two artist followed by Dangerously in Love. With Charts we look at them for the US only

Text analysis

library(tidytext)

taylor_swift_lyrics %>%
  rename_all(str_to_lower) %>%
  select(-artist) %>%
  unnest_tokens(word, lyrics) %>%
  anti_join(stop_words, by = "word") %>%
  count(word, sort = TRUE)

We started by working with the data set in words. Here we saw the stop words like love, time, wanna, baby, etc. Now we are going to compare the words used by the artists

released_dates <- charts %>%
  distinct(album = title, released) %>%
  mutate(album = fct_recode(album, folklore = "Folklore", reputation = "Reputation")) %>%
  mutate(released = str_remove(released, " \\(.*")) %>%
  mutate(released = mdy(released))



taylor_swift_words <- taylor_swift_lyrics %>%
  rename_all(str_to_lower) %>%
  select(-artist) %>%
  unnest_tokens(word, lyrics) %>%
  anti_join(stop_words, by = "word") %>%
  inner_join(released_dates, by = "album") %>%
  mutate(album = fct_reorder(album, released))

I came back (after) to add the charts and it gave me release dates for the albums for both artists.

taylor_swift_words %>%
  count(word, sort = TRUE) %>%
  head(25) %>%
  mutate(word = fct_reorder(word, n)) %>%
  ggplot(aes(n, word)) +
  geom_col()

Here are the most common words for Taylor Swift

ts_tf_idf <-taylor_swift_words %>%
  count(album, word) %>%
  bind_tf_idf(word, album, n) %>%
  arrange(desc(tf_idf)) 

ts_tf_idf %>%
  group_by(album) %>%
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, album)) %>%
  ggplot(aes(tf_idf, word)) +
  geom_col() +
  facet_wrap(~ album, scales = "free_y") +
  scale_y_reordered()

Now here we are looking at how many of the words are in each album, then we arranged them in descending order.

We then used a new code to place all the albums into their own graphs based on the words and how frequently they were used. I also worked with slice_max which was new to me. We lost “folklore” after reworking and renaming things so we had to go back and try to fix the code.

library(tidylo)

## Warning: package 'tidylo' was built under R version 4.0.3

ts_lo <-taylor_swift_words %>%
  count(album, word) %>%
  bind_log_odds(album, word, n) %>%
  arrange(desc(log_odds_weighted)) 

ts_lo %>%
  group_by(album) %>%
  slice_max(log_odds_weighted, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, log_odds_weighted, album)) %>%
  ggplot(aes(log_odds_weighted, word)) +
  geom_col() +
  facet_wrap(~ album, scales = "free_y") +
  scale_y_reordered()

filler <- c("ah", "uh", "ha", "ey", "eh", "eeh", "huh")

ts_lo %>%
  filter(word %in% filler) %>%
   group_by(album) %>%
  slice_max(log_odds_weighted, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, log_odds_weighted, album)) %>%
  ggplot(aes(log_odds_weighted, word)) +
  geom_col() +
  facet_wrap(~ album, scales = "free_y") +
  scale_y_reordered()

ts_lo %>%
  filter(word %in% filler) %>%
  mutate(word = reorder_within(word, n, album)) %>%
  ggplot(aes(log_odds_weighted, n)) +
  geom_col() +
  facet_wrap(~ album, scales = "free_y") +
  scale_y_reordered() +
  labs(title = "The filler words in Taylor Swift have changed across albums", x = "# of appearances in albums", y = "")

These graphs were similar to the ones above the difference is not too noticeable but we are getting words that are a lot more common here than with the other data sets above. These songs have repeated motifs. At this point the word album had been lost on my end and could not be found so I had to go back and play with the codes again until I was able to get album to read. I liked the graph we used for the albums were we used the filler words as we were able to see how many times the filler words were used per album. After going back and working with our previous data set I ended up losing the album reputation.

TODO * Compare words used by Taylor Swift and Beyonce

ts <- taylor_swift_lyrics %>%
  rename_all(str_to_lower) %>%
  rename(song = title) %>%
  select(-album)

beyonce <- beyonce_lyrics %>%
  select(artist = artist_name, title = song_name, lyrics = line)

artist_song_words_raw <- bind_rows(ts, beyonce) %>%
  unnest_tokens(word, lyrics) %>%
  count(artist, title, word)

artist_song_words <- artist_song_words_raw %>%
  anti_join(stop_words, by = "word")

What was interesting with the beyonce data was that there were no albums, it was solely based on the songs. So, we took Taylor Swift and removed the albums from her to be able to compare across songs. We also shortened the artists names to “ts” and “beyonce”. We then took the two artist to combine them.

Compare overall, not per song

by_artist_word <- artist_song_words %>%
  group_by(artist, word) %>%
  summarize(num_songs = n(), num_words = sum(n)) %>%
  mutate(pct_words = num_words / sum(num_words)) %>%
  group_by(word) %>%
  mutate(num_words_total = sum(num_words)) %>%
  bind_log_odds(artist, word, num_words)

## `summarise()` regrouping output by 'artist' (override with `.groups` argument)

word_differences <- by_artist_word %>%
  arrange(desc(abs(log_odds_weighted))) %>%
  filter(artist == "Beyoncé") %>%
  slice_max(num_words_total, n = 100, with_ties = FALSE) %>%
  slice_max(abs(log_odds_weighted), n = 25, with_ties = FALSE) %>%
  mutate(word = fct_reorder(word, log_odds_weighted)) %>%
  mutate(direction = ifelse(log_odds_weighted > 0, "Beyoncé", "Taylor Swift")) 
  
word_differences %>%  
  ggplot(aes(log_odds_weighted, word, fill = direction)) +
  geom_col() +
  scale_x_continuous(breaks = log(2 ^ seq(-6, 9, 3)) , labels = paste0(2 ^ abs(seq(-6, 9, 3)), "X"))  +
  labs(x= "Relative use in Beyoncé vs Taylor Swift (weighted)", y = "", title = "Which words most distinguish Beyoncé and Taylor Swift songs?", subtitle = "Among the 100 words most used by the artists (combined)", fill = "")

x_labels <- paste0(2 ^ abs(seq(-6, 9, 7)), "X")
x_labels <- ifelse(x_labels == "1X", "Same", x_labels)

word_differences %>%
  ggplot(aes(log_odds_weighted, word)) +
  geom_col(width = .1) +
  geom_point(aes(size = num_words_total, color = direction)) +
  geom_vline(lty = 2, xintercept = 0) +
  scale_x_continuous(breaks = log(2 ^ seq(-6, 9, 7)) , labels = x_labels)  +
  labs(x= "Relative use in Beyoncé vs Taylor Swift (weighted)", y = "", title = "Which words most distinguish Beyoncé and Taylor Swift songs?", subtitle = "Among the 100 words most used by the artists (combined)", color = "", size = "# of words\n (both artists")

I was getting really weird song names, almost as if the songs were now in a different language, I tried to fix the codes but could not figure it out. I was getting the correct numbers and all, the words just would not compute correctly for the first dataset. I was having trouble after I used “slice_max” and my graph would not pop up. I finally was able to get a graph after playing with the codes, but my graph did look quite different from David Robinson graph.

by_artist_word %>%
  select(artist, word, num_words_total) %>%
  pivot_wider(names_from = artist, values_from = pct_words, values_fill = list(pct_words = 0)) %>%
  janitor:: clean_names() %>%
  slice_max(num_words_total, n = 200, with_ties = FALSE) %>%
  ggplot(aes(taylor_swift, beyonce)) +
  geom_point() +
  scale_x_log10(labels = percent) +
  scale_y_log10(labels = percent)

#excluded this chunk last, as again I was running into errors

Conclusion

Overall I really enjoyed using this dataset with Beyonce and Taylor Swift as they are two of my favorite artists. I ran into a bit of trouble along the way as David Robinson works really quickly! I was able to fix the problems and enjoyed looking at these fun graphs.

Tidy Tuesday: Beyoncé vs Taylor Swift data

2020-12-01

TidyTuesday

Looking at sales and charts

Text analysis

Compare overall, not per song

Conclusion