Hi all–
Great lab! Hope we figured out some of the flexibility and power of the dplyr
and tidyverse
approach.
The first questions asks us to generate a new variable, called d1$decade
, which will allow us to group together all the separate charting results within decade-long groups, and do calculations inside those groups.
we’ll start by loading the data
library(plyr)
library(tidyverse)
library(magrittr)
library(lubridate)
d1 <- "https://github.com/thomasjwood/ps4160/raw/master/billboard_58_21.rds" %>%
url %>%
gzcon %>%
readRDS
Then we’ll generate our decade indicator. Let’s step through this process piece by piece.
d1$decade <- d1$week_id %>%
floor_date(years(10)) %>%
year %>%
as.numeric %>%
str_c("s") %>%
fct_inorder
The first bit
d1$week_id %>%
floor_date(years(10))
takes the week indicator, and rounds them down to the nearest decade. We could have said years(1)
to round to the year, or we could have said years(100)
to round down to the nearest century. To round up, we helpfully would use the command ceiling_date
.)
The next bit
d1$week_id %>%
floor_date(years(10)) %>%
year
Takes the new date vector and keeps only the year.
The final bit
d1$decade <- d1$week_id %>%
floor_date(years(10)) %>%
year %>%
as.numeric %>%
str_c("s") %>%
fct_inorder
replaces the year with a nice labelled factor, which will be in a nice chronological order.
Armed with this variable, we want to:
For every artist decade–that is, if an artist has charted in separate decade–compute the number of charted songs separately by decade, within artists—Taylor Swift demonstrates this possibility in our answer!
Sort the table, by decade, and then by the number of charted unique songs
By decade, return the top 3 rows (are artist decade)
which should be
d1 %>%
group_by(
decade, performer
) %>%
summarize(
tracks = song_id %>% unique %>% length
) %>%
arrange(desc(tracks)) %>%
slice(1:3) %>%
arrange(
desc(decade)
) %>%
print(n = 24)
which should return
whoa Rascal Flatts, Toby Keith. Country music fans really kept buying singles longer than anyone else, in the face of downloading. Gosh.
Ok, the cursed question, withdrawn as soon as it was issued. What did it ask?
Compare songs that debuted in the top 40 to those which debuted outside the top 40. By decade, which group of songs spend longer on the charts?
I’ll show you how I did this:
t1 <- d1 %>%
left_join(
d1 %>%
select(
song_id, week_id, week_position
) %>%
arrange(song_id, week_id) %>%
group_by(song_id) %>%
slice(1) %>%
mutate(
debut = case_when(
week_position %>%
is_weakly_less_than(40) ~ "debut_in_top40",
TRUE ~ "debut_out_top40"
)
) %>%
select(song_id, debut)
) %>%
select(
song_id, week_id, song, performer, decade, debut, weeks_on_chart
) %>%
arrange(
song_id, week_id
) %>%
group_by(song_id) %>%
slice(n()) %>%
group_by(decade, debut) %>%
summarize(
week_mu = weeks_on_chart %>% mean
) %>%
spread(debut, week_mu) %>%
ungroup %>%
mutate(
time = debut_in_top40 - debut_out_top40
)
Which should return
You can maybe why I needed to use left_join
– I needed a separate indicator, for each song, to indicate whether that song debuted in the top 40 (ie, what its chart position was in its first week.) Then I used that indicator to compare the total number of charting weeks.
This question asks:
By my estimation, there are six artists who’ve had at least one top ten charting single in four (or more) separate decades. Who are they?
So we need to:
Look at only those songs which have charted in the top 10, for at least 1 week
Among this sub group of songs, by performer, report the unique number of decades
which is given by
d1 %>%
filter(
week_position %>%
is_in(1:10)
) %>%
group_by(performer) %>%
summarize(
nd = decade %>% unique %>% length
) %>%
arrange(desc(nd)) %>%
filter(
nd >= 4
)
which should return
performer nd
<chr> <int>
1 Andy Williams 5
2 Aerosmith 4
3 Cher 4
4 Mariah Carey 4
5 Michael Jackson 4
6 Whitney Houston 4