For this homework assignment, you’ll need to install the
tidytext
, taylor
, and ggtext
packages. You won’t need to load them, but they’ll need to be installed
for the code chunk below to work.
## # A tibble: 25,698 × 6
## album album_simple track_name line element word
## <fct> <fct> <chr> <int> <chr> <chr>
## 1 Taylor Swift Taylor Swift Tim McGraw 1 Verse 1 blue
## 2 Taylor Swift Taylor Swift Tim McGraw 1 Verse 1 eyes
## 3 Taylor Swift Taylor Swift Tim McGraw 1 Verse 1 shined
## 4 Taylor Swift Taylor Swift Tim McGraw 2 Verse 1 georgia
## 5 Taylor Swift Taylor Swift Tim McGraw 2 Verse 1 stars
## 6 Taylor Swift Taylor Swift Tim McGraw 2 Verse 1 shame
## 7 Taylor Swift Taylor Swift Tim McGraw 2 Verse 1 night
## 8 Taylor Swift Taylor Swift Tim McGraw 3 Verse 1 lie
## 9 Taylor Swift Taylor Swift Tim McGraw 4 Verse 1 boy
## 10 Taylor Swift Taylor Swift Tim McGraw 4 Verse 1 chevy
## # ℹ 25,688 more rows
The swift
data set has the lyrics on all eleven of
Taylor Swift’s albums converted with each line representing a word in
her lyrics with the stop
words removed.
The columns in the data are:
album
: the album the word is intrack_name
: the name of the song the word is inline
: the line the word is inelement
: if the word is in the chorus, which verse,
etc…word
: the word used in the songLet’s look to see if Taylor Swift’s top 100 words in her lyrics (without stop words) follows Zipf’s law. According to Zipf’s law, how frequently each word appears (relative to the most common word) should be about
\[\textrm{word frequency} \propto \frac{1}{\textrm{word rank}}\]
Create the data set of Taylor Swift’s top 100 words with the following columns:
word
: The wordfreq
: How often the word occursrank
: The word rank (most common word = 1, 50th most
common word = 50)rank = row_number()
as long as
you arrange the rows correctly!exp_freq
: The expected frequency of each word if Zipf’s
law is trueThe expected frequency is
\[\text{expected frequency} = \frac{\max(\text{frequency})}{\text{rank}}\]
where \(\max{\text{frequency}}\) is the frequency of the most common word.
swift_top_words <-
swift |>
# Counting how often each word occurs
count(word, name = 'freq') |>
# Arranging them in descending order
arrange(-freq) |>
# Keeping the top 100 (and ordering them descendingly)
slice(1:100) |>
# Adding the rank using row_number()
mutate(
rank = row_number(),
exp_freq = 1/rank * max(freq)
)
tibble(swift_top_words)
## # A tibble: 100 × 4
## word freq rank exp_freq
## <chr> <int> <int> <dbl>
## 1 love 427 1 427
## 2 time 390 2 214.
## 3 ooh 286 3 142.
## 4 baby 237 4 107.
## 5 gonna 224 5 85.4
## 6 ah 222 6 71.2
## 7 wanna 199 7 61
## 8 yeah 192 8 53.4
## 9 night 158 9 47.4
## 10 stay 139 10 42.7
## # ℹ 90 more rows
Create the scatterplot seen in Brightspace. The text for the 10 most common words have been included.
Note: Both methods of adding text to a graph have been used. The position of the 10 most common words is random, so your graph’s text won’t be identical, but the words should be the same and the position of the points should be the same
RNGversion('4.1.0'); set.seed(2870)
ggplot(
data = swift_top_words,
mapping = aes(
x = exp_freq,
y = freq
)
) +
geom_point() +
# The line the points should follow if Zipf's law is true
geom_abline(
intercept = 0,
slope = 1,
color = 'blue',
linetype = 'dashed'
) +
# Adding the text of the 10 most common words
ggrepel::geom_text_repel(
# Just displaying the 10 most common words
data = swift_top_words |> slice_max(freq, n = 10),
mapping = aes(label = word),
nudge_y = .1
) +
# Adding a note that the axes are in log10 scale
annotate(
geom = "text",
x = 200,
y = 30,
label = "x and y-axes are in log10 scale"
#fontface = "bold"
) +
# Adding labels and titles
labs(
x = "Expected Frequency by Zipf's Law",
y = "Actual Frequency",
title = "Does Zipf's Law Explain the Frequency of Taylor Swift Lyrics?",
subtitle = "Blue line indicates the trend if Zipf's Law holds",
caption = 'Stop words excluded'
) +
theme_classic() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
) +
# Changing the labels to log10 scales
scale_x_log10() +
scale_y_log10()
Does it appear the Taylor Swift’s 100 most commonly used words follows Zipf’s Law? If yes, explain why. If no, do the words occur more frequently than they should or less frequently than they should?
Create the graph seen in Brightspace of how often Taylor
Swift’s 10 most common words are used in each album. You’ll need to use
the swift
data set, not the data set you created for
Question 1. Use the album_simple
column instead of
album
.
Note: You’ll need to use
complete(album_simple, word, fill = list(n = 0))
where
n
represents the column of word counts at the end of the
pipe chain
swift |>
# Top 10 most common words
filter(word %in% (swift_top_words |> slice_max(freq, n = 10) |> pull(word))) |>
# Counting how often they occur per album
count(album_simple, word) |>
complete(album_simple, word, fill = list(n = 0)) |>
# Creating the graph
ggplot(
mapping = aes(
y = album_simple,
x = word,
fill = n
)
) +
geom_tile(
color = 'white',
show.legend = F
) +
# Adding the text for the word frequency
ggfittext::geom_fit_text(
mapping = aes(label = n),
contrast = T,
show.legend = F
) +
labs(
x = NULL,
y = NULL,
title = "Frequency of Taylor Swift's 10 Most Common Words by Album"
) +
theme(plot.title = element_text(hjust = 0.5)) +
# Removing the buffer space
coord_cartesian(expand = F) +
# Changing the color gradient
#scale_fill_viridis_c() +
taylor::scale_fill_taylor_c() +
# The function below will wrap THE TORTURED POETS DEPARTMENT into 2 lines
scale_y_discrete(labels = ~ if_else(nchar(.) > 13,
paste0(substr(., 1, 13), '\n', substr(., 14, nchar(.))),
.))
Use the data created in the code chunk below for this question
swift_q3 <-
swift |>
filter(
word %in% c('baby', 'bad', 'eyes', 'feel', 'heard', 'home', 'life',
'night', 'stay', 'time', 'call', 'run', 'red')
) |>
count(album, word) |>
pivot_wider(
id_cols = word,
names_from = album,
values_from = n,
values_fill = 0
)
swift_q3
## # A tibble: 13 × 12
## word `Taylor Swift` `Fearless (Taylor's Version)` Speak Now (Taylor's Vers…¹
## <chr> <int> <int> <int>
## 1 baby 15 52 14
## 2 bad 5 1 5
## 3 eyes 12 13 16
## 4 feel 5 45 10
## 5 heard 2 3 2
## 6 home 10 8 8
## 7 life 6 5 25
## 8 night 8 24 13
## 9 stay 5 11 7
## 10 time 15 39 47
## 11 call 0 5 7
## 12 run 0 12 12
## 13 red 0 0 0
## # ℹ abbreviated name: ¹`Speak Now (Taylor's Version)`
## # ℹ 8 more variables: `Red (Taylor's Version)` <int>,
## # `1989 (Taylor's Version)` <int>, reputation <int>, Lover <int>,
## # folklore <int>, evermore <int>, Midnights <int>,
## # `THE TORTURED POETS DEPARTMENT` <int>
Create the graph seen in Brightspace by adding to the code chunk below. The code to create the data will be very similar to the code for question 2, but you’ll only want two albums: Taylor Swift and THE TORTURED POETS DEPARTMENT
Note: To get the code chunk to work, you’ll want to use
album
, not album_simple
swift |>
# Counting how often they occur per album
count(album, word) |>
# Including word album combos with a count of 0
complete(album, word, fill = list(n = 0)) |>
filter(
# The first and last albums
album %in% c('Taylor Swift', 'THE TORTURED POETS DEPARTMENT'),
word %in% c('baby', 'bad', 'eyes', 'feel', 'heard',
'night', 'time', 'call', 'life', 'home')
) |>
# Creating the graph
ggplot(
mapping = aes(
x = n,
y = fct_reorder(word, n, max)
)
) +
geom_line() +
geom_point(
mapping = aes(color = album),
size = 3
) +
labs(
title = "Word Counts for Taylor's <span style='color:lightblue;'>Taylor Swift</span> vs <span style='color:grey70;'>TTDP</span> Albums",
x = NULL,
y = NULL
) +
theme_bw() +
theme(
legend.position = 'none',
plot.title = ggtext::element_markdown(hjust = 0.5)
) +
scale_color_manual(
values = c('lightblue', 'grey70')
)