For this homework assignment, you’ll need to install the
tidytext, taylor, and ggtext
packages. You won’t need to load them, but they’ll need to be installed
for the code chunk below to work.
## # A tibble: 25,698 × 6
##    album        album_simple track_name  line element word   
##    <fct>        <fct>        <chr>      <int> <chr>   <chr>  
##  1 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 blue   
##  2 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 eyes   
##  3 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 shined 
##  4 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 georgia
##  5 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 stars  
##  6 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 shame  
##  7 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 night  
##  8 Taylor Swift Taylor Swift Tim McGraw     3 Verse 1 lie    
##  9 Taylor Swift Taylor Swift Tim McGraw     4 Verse 1 boy    
## 10 Taylor Swift Taylor Swift Tim McGraw     4 Verse 1 chevy  
## # ℹ 25,688 more rows
The swift data set has the lyrics on all eleven of
Taylor Swift’s albums converted with each line representing a word in
her lyrics with the stop
words removed.
The columns in the data are:
album: the album the word is intrack_name: the name of the song the word is inline: the line the word is inelement: if the word is in the chorus, which verse,
etc…word: the word used in the songLet’s look to see if Taylor Swift’s top 100 words in her lyrics (without stop words) follows Zipf’s law. According to Zipf’s law, how frequently each word appears (relative to the most common word) should be about
\[\textrm{word frequency} \propto \frac{1}{\textrm{word rank}}\]
Create the data set of Taylor Swift’s top 100 words with the following columns:
word: The wordfreq: How often the word occursrank: The word rank (most common word = 1, 50th most
common word = 50)rank = row_number() as long as
you arrange the rows correctly!exp_freq: The expected frequency of each word if Zipf’s
law is trueThe expected frequency is
\[\text{expected frequency} = \frac{\max(\text{frequency})}{\text{rank}}\]
where \(\max{\text{frequency}}\) is the frequency of the most common word.
swift_top_words <- 
  swift |> 
  # Counting how often each word occurs
  count(word, name = 'freq') |> 
  # Arranging them in descending order
  arrange(-freq) |> 
  # Keeping the top 100 (and ordering them descendingly)
  slice(1:100) |> 
  # Adding the rank using row_number() 
  mutate(
    rank = row_number(),
    exp_freq = 1/rank * max(freq)
  )
tibble(swift_top_words)
## # A tibble: 100 × 4
##    word   freq  rank exp_freq
##    <chr> <int> <int>    <dbl>
##  1 love    427     1    427  
##  2 time    390     2    214. 
##  3 ooh     286     3    142. 
##  4 baby    237     4    107. 
##  5 gonna   224     5     85.4
##  6 ah      222     6     71.2
##  7 wanna   199     7     61  
##  8 yeah    192     8     53.4
##  9 night   158     9     47.4
## 10 stay    139    10     42.7
## # ℹ 90 more rows
Create the scatterplot seen in Brightspace. The text for the 10 most common words have been included.
Note: Both methods of adding text to a graph have been used. The position of the 10 most common words is random, so your graph’s text won’t be identical, but the words should be the same and the position of the points should be the same
RNGversion('4.1.0'); set.seed(2870)
ggplot(
  data = swift_top_words,
  mapping = aes(
    x = exp_freq,
    y = freq
  )
) + 
  geom_point() + 
  # The line the points should follow if Zipf's law is true
  geom_abline(
    intercept = 0,
    slope = 1,
    color = 'blue',
    linetype = 'dashed'
  ) +
  # Adding the text of the 10 most common words
  ggrepel::geom_text_repel(
    # Just displaying the 10 most common words
    data = swift_top_words |> slice_max(freq, n = 10),
    mapping = aes(label = word),
    nudge_y = .1
  ) + 
  # Adding a note that the axes are in log10 scale
  annotate(
    geom = "text",
    x = 200,
    y = 30,
    label = "x and y-axes are in log10 scale"
    #fontface = "bold"
  ) + 
  # Adding labels and titles
  labs(
    x = "Expected Frequency by Zipf's Law",
    y = "Actual Frequency",
    title = "Does Zipf's Law Explain the Frequency of Taylor Swift Lyrics?",
    subtitle = "Blue line indicates the trend if Zipf's Law holds",
    caption = 'Stop words excluded'
  ) + 
  theme_classic() + 
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  # Changing the labels to log10 scales
  scale_x_log10() + 
  scale_y_log10()  
Does it appear the Taylor Swift’s 100 most commonly used words follows Zipf’s Law? If yes, explain why. If no, do the words occur more frequently than they should or less frequently than they should?
Create the graph seen in Brightspace of how often Taylor
Swift’s 10 most common words are used in each album. You’ll need to use
the swift data set, not the data set you created for
Question 1. Use the album_simple column instead of
album.
Note: You’ll need to use
complete(album_simple, word, fill = list(n = 0)) where
n represents the column of word counts at the end of the
pipe chain
swift |> 
  # Top 10 most common words
  filter(word %in% (swift_top_words |> slice_max(freq, n = 10) |> pull(word))) |> 
  # Counting how often they occur per album
  count(album_simple, word) |> 
  complete(album_simple, word, fill = list(n = 0)) |> 
  # Creating the graph
  ggplot(
    mapping = aes(
      y = album_simple,
      x = word,
      fill = n
    )
  ) + 
  geom_tile(
    color = 'white',
    show.legend = F
  ) + 
  # Adding the text for the word frequency
  ggfittext::geom_fit_text(
    mapping = aes(label = n),
    contrast = T,
    show.legend = F
  ) + 
  labs(
    x = NULL,
    y = NULL,
    title = "Frequency of Taylor Swift's 10 Most Common Words by Album"
  ) + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  # Removing the buffer space
  coord_cartesian(expand = F) + 
  # Changing the color gradient
  #scale_fill_viridis_c() + 
  taylor::scale_fill_taylor_c() +
  # The function below will wrap THE TORTURED POETS DEPARTMENT into 2 lines
  scale_y_discrete(labels = ~ if_else(nchar(.) > 13, 
                                      paste0(substr(., 1, 13), '\n', substr(., 14, nchar(.))),
                                      .))
Use the data created in the code chunk below for this question
swift_q3 <- 
  swift |> 
  filter(
    word %in% c('baby',   'bad', 'eyes', 'feel', 'heard', 'home', 'life', 
                'night', 'stay', 'time', 'call',   'run', 'red')
  ) |> 
  count(album, word) |> 
  pivot_wider(
    id_cols = word,
    names_from = album,
    values_from = n,
    values_fill = 0
  ) 
swift_q3
## # A tibble: 13 × 12
##    word  `Taylor Swift` `Fearless (Taylor's Version)` Speak Now (Taylor's Vers…¹
##    <chr>          <int>                         <int>                      <int>
##  1 baby              15                            52                         14
##  2 bad                5                             1                          5
##  3 eyes              12                            13                         16
##  4 feel               5                            45                         10
##  5 heard              2                             3                          2
##  6 home              10                             8                          8
##  7 life               6                             5                         25
##  8 night              8                            24                         13
##  9 stay               5                            11                          7
## 10 time              15                            39                         47
## 11 call               0                             5                          7
## 12 run                0                            12                         12
## 13 red                0                             0                          0
## # ℹ abbreviated name: ¹`Speak Now (Taylor's Version)`
## # ℹ 8 more variables: `Red (Taylor's Version)` <int>,
## #   `1989 (Taylor's Version)` <int>, reputation <int>, Lover <int>,
## #   folklore <int>, evermore <int>, Midnights <int>,
## #   `THE TORTURED POETS DEPARTMENT` <int>
Create the graph seen in Brightspace by adding to the code chunk below. The code to create the data will be very similar to the code for question 2, but you’ll only want two albums: Taylor Swift and THE TORTURED POETS DEPARTMENT
Note: To get the code chunk to work, you’ll want to use
album, not album_simple
swift |> 
    # Counting how often they occur per album
  count(album, word) |> 
  # Including word album combos with a count of 0
  complete(album, word, fill = list(n = 0)) |> 
  filter(
  # The first and last albums
    album %in% c('Taylor Swift', 'THE TORTURED POETS DEPARTMENT'),
     word %in% c('baby',   'bad', 'eyes', 'feel', 'heard', 
                 'night', 'time', 'call',  'life', 'home')
  ) |> 
  # Creating the graph
  ggplot(
    mapping = aes(
      x = n,
      y = fct_reorder(word, n, max)
    )
  ) +
  geom_line() + 
  geom_point(
    mapping = aes(color = album),
    size = 3
  ) + 
  labs(
    title = "Word Counts for Taylor's <span style='color:lightblue;'>Taylor Swift</span> vs <span style='color:grey70;'>TTDP</span> Albums",
    x = NULL,
    y = NULL
  ) +
  theme_bw() + 
  theme(
    legend.position = 'none',
    plot.title = ggtext::element_markdown(hjust = 0.5)
  ) +
  scale_color_manual(
    values = c('lightblue', 'grey70')
  )