For this homework assignment, you’ll need to install the
tidytext, taylor, and ggtext
packages. You won’t need to load them, but they’ll need to be installed
for the code chunk below to work.
## # A tibble: 25,698 × 6
##    album        album_simple track_name  line element word   
##    <fct>        <fct>        <chr>      <int> <chr>   <chr>  
##  1 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 blue   
##  2 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 eyes   
##  3 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 shined 
##  4 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 georgia
##  5 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 stars  
##  6 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 shame  
##  7 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 night  
##  8 Taylor Swift Taylor Swift Tim McGraw     3 Verse 1 lie    
##  9 Taylor Swift Taylor Swift Tim McGraw     4 Verse 1 boy    
## 10 Taylor Swift Taylor Swift Tim McGraw     4 Verse 1 chevy  
## # ℹ 25,688 more rows
The swift data set has the lyrics on all eleven of
Taylor Swift’s albums converted with each line representing a word in
her lyrics with the stop
words removed.
The columns in the data are:
album: the album the word is intrack_name: the name of the song the word is inline: the line the word is inelement: if the word is in the chorus, which verse,
etc…word: the word used in the songLet’s look to see if Taylor Swift’s top 100 words in her lyrics (without stop words) follows Zipf’s law. According to Zipf’s law, how frequently each word appears (relative to the most common word) should be about
\[\textrm{word frequency} \propto \frac{1}{\textrm{word rank}}\]
Create the data set of Taylor Swift’s top 100 words with the following columns:
word: The wordfreq: How often the word occursrank: The word rank (most common word = 1, 50th most
common word = 50)rank = row_number() as long as
you arrange the rows correctly!exp_freq: The expected frequency of each word if Zipf’s
law is trueThe expected frequency is
\[\text{expected frequency} = \frac{\max(\text{frequency})}{\text{rank}}\]
where \(\max{\text{frequency}}\) is the frequency of the most common word.
## # A tibble: 100 × 4
##    word   freq  rank exp_freq
##    <chr> <int> <int>    <dbl>
##  1 love    427     1    427  
##  2 time    390     2    214. 
##  3 ooh     286     3    142. 
##  4 baby    237     4    107. 
##  5 gonna   224     5     85.4
##  6 ah      222     6     71.2
##  7 wanna   199     7     61  
##  8 yeah    192     8     53.4
##  9 night   158     9     47.4
## 10 stay    139    10     42.7
## # ℹ 90 more rows
Create the scatterplot seen in Brightspace. The text for the 10 most common words have been included.
Note: Both methods of adding text to a graph have been used. The position of the 10 most common words is random, so your graph’s text won’t be identical, but the words should be the same and the position of the points should be the same
Does it appear the Taylor Swift’s 100 most commonly used words follows Zipf’s Law? If yes, explain why. If no, do the words occur more frequently than they should or less frequently than they should?
Create the graph seen in Brightspace of how often Taylor
Swift’s 10 most common words are used in each album. You’ll need to use
the swift data set, not the data set you created for
Question 1. Use the album_simple column instead of
album.
Note: You’ll need to use
complete(album_simple, word, fill = list(n = 0)) where
n represents the column of word counts at the end of the
pipe chain
Use the data created in the code chunk below for this question
## # A tibble: 13 × 12
##    word  `Taylor Swift` `Fearless (Taylor's Version)` Speak Now (Taylor's Vers…¹
##    <chr>          <int>                         <int>                      <int>
##  1 baby              15                            52                         14
##  2 bad                5                             1                          5
##  3 eyes              12                            13                         16
##  4 feel               5                            45                         10
##  5 heard              2                             3                          2
##  6 home              10                             8                          8
##  7 life               6                             5                         25
##  8 night              8                            24                         13
##  9 stay               5                            11                          7
## 10 time              15                            39                         47
## 11 call               0                             5                          7
## 12 run                0                            12                         12
## 13 red                0                             0                          0
## # ℹ abbreviated name: ¹`Speak Now (Taylor's Version)`
## # ℹ 8 more variables: `Red (Taylor's Version)` <int>,
## #   `1989 (Taylor's Version)` <int>, reputation <int>, Lover <int>,
## #   folklore <int>, evermore <int>, Midnights <int>,
## #   `THE TORTURED POETS DEPARTMENT` <int>
Create the graph seen in Brightspace by adding to the code chunk below. The code to create the data will be very similar to the code for question 2, but you’ll only want two albums: Taylor Swift and THE TORTURED POETS DEPARTMENT
Note: To get the code chunk to work, you’ll want to use
album, not album_simple