Installing needed packages

For this homework assignment, you’ll need to install the tidytext, taylor, and ggtext packages. You won’t need to load them, but they’ll need to be installed for the code chunk below to work.

## # A tibble: 25,698 × 6
##    album        album_simple track_name  line element word   
##    <fct>        <fct>        <chr>      <int> <chr>   <chr>  
##  1 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 blue   
##  2 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 eyes   
##  3 Taylor Swift Taylor Swift Tim McGraw     1 Verse 1 shined 
##  4 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 georgia
##  5 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 stars  
##  6 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 shame  
##  7 Taylor Swift Taylor Swift Tim McGraw     2 Verse 1 night  
##  8 Taylor Swift Taylor Swift Tim McGraw     3 Verse 1 lie    
##  9 Taylor Swift Taylor Swift Tim McGraw     4 Verse 1 boy    
## 10 Taylor Swift Taylor Swift Tim McGraw     4 Verse 1 chevy  
## # ℹ 25,688 more rows

Data Description

The swift data set has the lyrics on all eleven of Taylor Swift’s albums converted with each line representing a word in her lyrics with the stop words removed.

The columns in the data are:

  1. album: the album the word is in
  2. track_name: the name of the song the word is in
  3. line: the line the word is in
  4. element: if the word is in the chorus, which verse, etc…
  5. word: the word used in the song

Question 1: Scatterplot for Zipf’s law

Let’s look to see if Taylor Swift’s top 100 words in her lyrics (without stop words) follows Zipf’s law. According to Zipf’s law, how frequently each word appears (relative to the most common word) should be about

\[\textrm{word frequency} \propto \frac{1}{\textrm{word rank}}\]

Part 1a) Creating the data

Create the data set of Taylor Swift’s top 100 words with the following columns:

  1. word: The word
  2. freq: How often the word occurs
  3. rank: The word rank (most common word = 1, 50th most common word = 50)
  • This can be done with rank = row_number() as long as you arrange the rows correctly!
  1. exp_freq: The expected frequency of each word if Zipf’s law is true

The expected frequency is

\[\text{expected frequency} = \frac{\max(\text{frequency})}{\text{rank}}\]

where \(\max{\text{frequency}}\) is the frequency of the most common word.

## # A tibble: 100 × 4
##    word   freq  rank exp_freq
##    <chr> <int> <int>    <dbl>
##  1 love    427     1    427  
##  2 time    390     2    214. 
##  3 ooh     286     3    142. 
##  4 baby    237     4    107. 
##  5 gonna   224     5     85.4
##  6 ah      222     6     71.2
##  7 wanna   199     7     61  
##  8 yeah    192     8     53.4
##  9 night   158     9     47.4
## 10 stay    139    10     42.7
## # ℹ 90 more rows

Part 1b) Scatterplot of Zipf’s Law

Create the scatterplot seen in Brightspace. The text for the 10 most common words have been included.

Note: Both methods of adding text to a graph have been used. The position of the 10 most common words is random, so your graph’s text won’t be identical, but the words should be the same and the position of the points should be the same

Does it appear the Taylor Swift’s 100 most commonly used words follows Zipf’s Law? If yes, explain why. If no, do the words occur more frequently than they should or less frequently than they should?

Question 2: Heat Map of Swift Lyrics by Album

Create the graph seen in Brightspace of how often Taylor Swift’s 10 most common words are used in each album. You’ll need to use the swift data set, not the data set you created for Question 1. Use the album_simple column instead of album.

Note: You’ll need to use complete(album_simple, word, fill = list(n = 0)) where n represents the column of word counts at the end of the pipe chain

Question 3: Dumbbell plot comparing 10 most common words for first album (Taylor Swift) vs Most Recent Album (TTPD)

Use the data created in the code chunk below for this question

## # A tibble: 13 × 12
##    word  `Taylor Swift` `Fearless (Taylor's Version)` Speak Now (Taylor's Vers…¹
##    <chr>          <int>                         <int>                      <int>
##  1 baby              15                            52                         14
##  2 bad                5                             1                          5
##  3 eyes              12                            13                         16
##  4 feel               5                            45                         10
##  5 heard              2                             3                          2
##  6 home              10                             8                          8
##  7 life               6                             5                         25
##  8 night              8                            24                         13
##  9 stay               5                            11                          7
## 10 time              15                            39                         47
## 11 call               0                             5                          7
## 12 run                0                            12                         12
## 13 red                0                             0                          0
## # ℹ abbreviated name: ¹​`Speak Now (Taylor's Version)`
## # ℹ 8 more variables: `Red (Taylor's Version)` <int>,
## #   `1989 (Taylor's Version)` <int>, reputation <int>, Lover <int>,
## #   folklore <int>, evermore <int>, Midnights <int>,
## #   `THE TORTURED POETS DEPARTMENT` <int>

Create the graph seen in Brightspace by adding to the code chunk below. The code to create the data will be very similar to the code for question 2, but you’ll only want two albums: Taylor Swift and THE TORTURED POETS DEPARTMENT

Note: To get the code chunk to work, you’ll want to use album, not album_simple