Installing needed packages

For this homework assignment, you’ll need to install the tidytext, taylor, and ggtext packages. You won’t need to load them, but they’ll need to be installed for the code chunk below to work.

## # A tibble: 25,698 × 6
##    album        album_simple track_name  line element word   
##    <fct>        <fct>        <chr>      <int> <chr>   <chr>  
##  1 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 blue   
##  2 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 eyes   
##  3 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 shined 
##  4 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 georgia
##  5 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 stars  
##  6 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 shame  
##  7 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 night  
##  8 Taylor Swift Taylor Swift Tim McGraw     3 Verse 1 lie    
##  9 Taylor Swift Taylor Swift Tim McGraw     4 Verse 1 boy    
## 10 Taylor Swift Taylor Swift Tim McGraw     4 Verse 1 chevy  
## # ℹ 25,688 more rows

Data Description

The swift data set has the lyrics on all eleven of Taylor Swift’s albums converted with each line representing a word in her lyrics with the stop words removed.

The columns in the data are:

  1. album: the album the word is in
  2. track_name: the name of the song the word is in
  3. line: the line the word is in
  4. element: if the word is in the chorus, which verse, etc…
  5. word: the word used in the song

Question 1: Scatterplot for Zipf’s law

Let’s look to see if Taylor Swift’s top 100 words in her lyrics (without stop words) follows Zipf’s law. According to Zipf’s law, how frequently each word appears (relative to the most common word) should be about

\[\textrm{word frequency} \propto \frac{1}{\textrm{word rank}}\]

Part 1a) Creating the data

Create the data set of Taylor Swift’s top 100 words with the following columns:

  1. word: The word
  2. freq: How often the word occurs
  3. rank: The word rank (most common word = 1, 50th most common word = 50)
  • This can be done with rank = row_number() as long as you arrange the rows correctly!
  1. exp_freq: The expected frequency of each word if Zipf’s law is true

The expected frequency is

\[\text{expected frequency} = \frac{\max(\text{frequency})}{\text{rank}}\]

where \(\max{\text{frequency}}\) is the frequency of the most common word.

swift_top_words <- 
  swift |> 
  # Counting how often each word occurs
  count(word, name = 'freq') |> 
  # Arranging them in descending order
  arrange(-freq) |> 
  # Keeping the top 100 (and ordering them descendingly)
  slice(1:100) |> 
  # Adding the rank using row_number() 
  mutate(
    rank = row_number(),
    exp_freq = 1/rank * max(freq)
  )

tibble(swift_top_words)
## # A tibble: 100 × 4
##    word   freq  rank exp_freq
##    <chr> <int> <int>    <dbl>
##  1 love    427     1    427  
##  2 time    390     2    214. 
##  3 ooh     286     3    142. 
##  4 baby    237     4    107. 
##  5 gonna   224     5     85.4
##  6 ah      222     6     71.2
##  7 wanna   199     7     61  
##  8 yeah    192     8     53.4
##  9 night   158     9     47.4
## 10 stay    139    10     42.7
## # ℹ 90 more rows

Part 1b) Scatterplot of Zipf’s Law

Create the scatterplot seen in Brightspace. The text for the 10 most common words have been included.

Note: Both methods of adding text to a graph have been used. The position of the 10 most common words is random, so your graph’s text won’t be identical, but the words should be the same and the position of the points should be the same

RNGversion('4.1.0'); set.seed(2870)
ggplot(
  data = swift_top_words,
  mapping = aes(
    x = exp_freq,
    y = freq
  )
) + 
  geom_point() + 
  # The line the points should follow if Zipf's law is true
  geom_abline(
    intercept = 0,
    slope = 1,
    color = 'blue',
    linetype = 'dashed'
  ) +
  # Adding the text of the 10 most common words
  ggrepel::geom_text_repel(
    # Just displaying the 10 most common words
    data = swift_top_words |> slice_max(freq, n = 10),
    mapping = aes(label = word),
    nudge_y = .1
  ) + 
  # Adding a note that the axes are in log10 scale
  annotate(
    geom = "text",
    x = 200,
    y = 30,
    label = "x and y-axes are in log10 scale"
    #fontface = "bold"
  ) + 
  # Adding labels and titles
  labs(
    x = "Expected Frequency by Zipf's Law",
    y = "Actual Frequency",
    title = "Does Zipf's Law Explain the Frequency of Taylor Swift Lyrics?",
    subtitle = "Blue line indicates the trend if Zipf's Law holds",
    caption = 'Stop words excluded'
  ) + 
  theme_classic() + 
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  # Changing the labels to log10 scales
  scale_x_log10() + 
  scale_y_log10()  

Does it appear the Taylor Swift’s 100 most commonly used words follows Zipf’s Law? If yes, explain why. If no, do the words occur more frequently than they should or less frequently than they should?

Question 2: Heat Map of Swift Lyrics by Album

Create the graph seen in Brightspace of how often Taylor Swift’s 10 most common words are used in each album. You’ll need to use the swift data set, not the data set you created for Question 1. Use the album_simple column instead of album.

Note: You’ll need to use complete(album_simple, word, fill = list(n = 0)) where n represents the column of word counts at the end of the pipe chain

swift |> 
  # Top 10 most common words
  filter(word %in% (swift_top_words |> slice_max(freq, n = 10) |> pull(word))) |> 
  # Counting how often they occur per album
  count(album_simple, word) |> 
  complete(album_simple, word, fill = list(n = 0)) |> 
  # Creating the graph
  ggplot(
    mapping = aes(
      y = album_simple,
      x = word,
      fill = n
    )
  ) + 
  geom_tile(
    color = 'white',
    show.legend = F
  ) + 
  # Adding the text for the word frequency
  ggfittext::geom_fit_text(
    mapping = aes(label = n),
    contrast = T,
    show.legend = F
  ) + 
  labs(
    x = NULL,
    y = NULL,
    title = "Frequency of Taylor Swift's 10 Most Common Words by Album"
  ) + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  # Removing the buffer space
  coord_cartesian(expand = F) + 
  # Changing the color gradient
  #scale_fill_viridis_c() + 
  taylor::scale_fill_taylor_c() +
  # The function below will wrap THE TORTURED POETS DEPARTMENT into 2 lines
  scale_y_discrete(labels = ~ if_else(nchar(.) > 13, 
                                      paste0(substr(., 1, 13), '\n', substr(., 14, nchar(.))),
                                      .))

Question 3: Dumbbell plot comparing 10 common words for first album (Taylor Swift) vs Most Recent Album (TTPD)

Use the data created in the code chunk below for this question

swift_q3 <- 
  swift |> 
  filter(
    word %in% c('baby',   'bad', 'eyes', 'feel', 'heard', 'home', 'life', 
                'night', 'stay', 'time', 'call',   'run', 'red')
  ) |> 
  count(album, word) |> 
  pivot_wider(
    id_cols = word,
    names_from = album,
    values_from = n,
    values_fill = 0
  ) 

swift_q3
## # A tibble: 13 × 12
##    word  `Taylor Swift` `Fearless (Taylor's Version)` Speak Now (Taylor's Vers…¹
##    <chr>          <int>                         <int>                      <int>
##  1 baby              15                            52                         14
##  2 bad                5                             1                          5
##  3 eyes              12                            13                         16
##  4 feel               5                            45                         10
##  5 heard              2                             3                          2
##  6 home              10                             8                          8
##  7 life               6                             5                         25
##  8 night              8                            24                         13
##  9 stay               5                            11                          7
## 10 time              15                            39                         47
## 11 call               0                             5                          7
## 12 run                0                            12                         12
## 13 red                0                             0                          0
## # ℹ abbreviated name: ¹​`Speak Now (Taylor's Version)`
## # ℹ 8 more variables: `Red (Taylor's Version)` <int>,
## #   `1989 (Taylor's Version)` <int>, reputation <int>, Lover <int>,
## #   folklore <int>, evermore <int>, Midnights <int>,
## #   `THE TORTURED POETS DEPARTMENT` <int>

Create the graph seen in Brightspace by adding to the code chunk below. The code to create the data will be very similar to the code for question 2, but you’ll only want two albums: Taylor Swift and THE TORTURED POETS DEPARTMENT

Note: To get the code chunk to work, you’ll want to use album, not album_simple

swift |> 
    # Counting how often they occur per album
  count(album, word) |> 
  # Including word album combos with a count of 0
  complete(album, word, fill = list(n = 0)) |> 
  filter(
  # The first and last albums
    album %in% c('Taylor Swift', 'THE TORTURED POETS DEPARTMENT'),
     word %in% c('baby',   'bad', 'eyes', 'feel', 'heard', 
                 'night', 'time', 'call',  'life', 'home')
  ) |> 
  # Creating the graph
  ggplot(
    mapping = aes(
      x = n,
      y = fct_reorder(word, n, max)
    )
  ) +
  geom_line() + 
  geom_point(
    mapping = aes(color = album),
    size = 3
  ) + 
  labs(
    title = "Word Counts for Taylor's <span style='color:lightblue;'>Taylor Swift</span> vs <span style='color:grey70;'>TTDP</span> Albums",
    x = NULL,
    y = NULL
  ) +
  theme_bw() + 
  theme(
    legend.position = 'none',
    plot.title = ggtext::element_markdown(hjust = 0.5)
  ) +
  scale_color_manual(
    values = c('lightblue', 'grey70')
  )