For the second untidy data set, I chose the Rolling Stone Album Rankings. This data set compares album rankings from 2003, 2012, and, 2020.
This data set is untidy because each row represents an album for multiple years (each year being a unique variable). If we now wanted to add in the year 2024, we would need to add a column making the data frame wider. Instead, each observation should represent and album and year combination.
Source: https://github.com/rfordatascience/tidytuesday/tree/main/data/2024/2024-05-07
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
library(stringi)
library(tidyr)
df <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/refs/heads/main/data/2024/2024-05-07/rolling_stone.csv")
## Rows: 691 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): sort_name, clean_name, album, genre, type, spotify_url, artist_gen...
## dbl (13): rank_2003, rank_2012, rank_2020, differential, release_year, weeks...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df
## # A tibble: 691 × 21
## sort_name clean_name album rank_2003 rank_2012 rank_2020 differential
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Sinatra, Frank Frank Sinatra "In … 100 101 282 -182
## 2 Diddley, Bo Bo Diddley "Bo … 214 216 455 -241
## 3 Presley, Elvis Elvis Presley "Elv… 55 56 332 -277
## 4 Sinatra, Frank Frank Sinatra "Son… 306 308 NA -195
## 5 Little Richard Little Richa… "Her… 50 50 227 -177
## 6 Beyonce Beyonce "Lem… NA NA 32 469
## 7 Winehouse, Amy Amy Winehouse "Bac… NA 451 33 468
## 8 Crickets Buddy Holly "The… 421 420 NA -80
## 9 Bush, Kate Kate Bush "Hou… NA NA 68 433
## 10 Davis, Miles Miles Davis "Kin… 12 12 31 -19
## # ℹ 681 more rows
## # ℹ 14 more variables: release_year <dbl>, genre <chr>, type <chr>,
## # weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## # spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## # artist_gender <chr>, artist_birth_year_sum <dbl>,
## # debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## # years_between <dbl>, album_id <chr>
We now have the data frame from Excel, so let’s tidy up the data.
The individual year variables are actually values that should be in a general year column:
df2 <- df %>%
pivot_longer(cols = rank_2003:rank_2020, names_to = "year", values_to = "rank")
head(df2, 10)
## # A tibble: 10 × 20
## sort_name clean_name album differential release_year genre type
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 Sinatra, Frank Frank Sinatra In the We… -182 1955 Big … Stud…
## 2 Sinatra, Frank Frank Sinatra In the We… -182 1955 Big … Stud…
## 3 Sinatra, Frank Frank Sinatra In the We… -182 1955 Big … Stud…
## 4 Diddley, Bo Bo Diddley Bo Diddle… -241 1955 Rock… Stud…
## 5 Diddley, Bo Bo Diddley Bo Diddle… -241 1955 Rock… Stud…
## 6 Diddley, Bo Bo Diddley Bo Diddle… -241 1955 Rock… Stud…
## 7 Presley, Elvis Elvis Presley Elvis Pre… -277 1956 Rock… Stud…
## 8 Presley, Elvis Elvis Presley Elvis Pre… -277 1956 Rock… Stud…
## 9 Presley, Elvis Elvis Presley Elvis Pre… -277 1956 Rock… Stud…
## 10 Sinatra, Frank Frank Sinatra Songs for… -195 1956 Big … Stud…
## # ℹ 13 more variables: weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## # spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## # artist_gender <chr>, artist_birth_year_sum <dbl>,
## # debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## # years_between <dbl>, album_id <chr>, year <chr>, rank <dbl>
We now have a tidy data frame where each row represent a year and album combination. Note, I did not change the meaning of any of the current variables such as “differential” because I would be changing the data.
Let’s also change the year
variable to exclude
rank_
from the values since that came from the
transformation.
df2$year <- str_sub(df2$year,6)
df2
## # A tibble: 2,073 × 20
## sort_name clean_name album differential release_year genre type
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 Sinatra, Frank Frank Sinatra In the We… -182 1955 Big … Stud…
## 2 Sinatra, Frank Frank Sinatra In the We… -182 1955 Big … Stud…
## 3 Sinatra, Frank Frank Sinatra In the We… -182 1955 Big … Stud…
## 4 Diddley, Bo Bo Diddley Bo Diddle… -241 1955 Rock… Stud…
## 5 Diddley, Bo Bo Diddley Bo Diddle… -241 1955 Rock… Stud…
## 6 Diddley, Bo Bo Diddley Bo Diddle… -241 1955 Rock… Stud…
## 7 Presley, Elvis Elvis Presley Elvis Pre… -277 1956 Rock… Stud…
## 8 Presley, Elvis Elvis Presley Elvis Pre… -277 1956 Rock… Stud…
## 9 Presley, Elvis Elvis Presley Elvis Pre… -277 1956 Rock… Stud…
## 10 Sinatra, Frank Frank Sinatra Songs for… -195 1956 Big … Stud…
## # ℹ 2,063 more rows
## # ℹ 13 more variables: weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## # spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## # artist_gender <chr>, artist_birth_year_sum <dbl>,
## # debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## # years_between <dbl>, album_id <chr>, year <chr>, rank <dbl>
We now have a tidy data frame that represents the same data as the original, but is in a tidy format.
For the analysis, let’s first see how the average Spotify popularity for each album release year changes over time. How do albums released a few decades ago compare to recent albums for a modern music platform like Spotify?
spotify_popularity_release_year <- df2 %>%
group_by(release_year) %>%
summarise(mean_spotify_popularity = mean(spotify_popularity))
spotify_popularity_release_year
## # A tibble: 65 × 2
## release_year mean_spotify_popularity
## <dbl> <dbl>
## 1 1955 49
## 2 1956 60
## 3 1957 55.5
## 4 1958 47
## 5 1959 43.5
## 6 1960 45.3
## 7 1961 NA
## 8 1962 NA
## 9 1963 59
## 10 1964 NA
## # ℹ 55 more rows
ggplot(data = spotify_popularity_release_year, aes(x = release_year, y = mean_spotify_popularity)) +
geom_line()
This is not the best graph since there isn’t a top Rolling Stone album released each year, but we can somewhat see the general trend. It does seem like older, more classic albums, have lower popularity than recent albums. This makes sense because younger people are more likely to use Spotify and might not be listening to older, more classic albums.
Now let’s add in genre to see if that has changed over time:
spotify_popularity_release_year <- df2 %>%
group_by(release_year, genre) %>%
summarise(mean_spotify_popularity = mean(spotify_popularity))
## `summarise()` has grouped output by 'release_year'. You can override using the
## `.groups` argument.
ggplot(data = spotify_popularity_release_year, aes(x = release_year, y = mean_spotify_popularity, color = genre)) +
geom_point()
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).
Since there are so many genres, it makes the graph a bit difficult to read. Looking at the colors, you can see there are a cluster of Soul/Gospel/R&B albums with a mean ranking of almost 50 that were released a little before 1970. 1960 - 1990 has the most Soul/Gospel/R&B albums. Additionally, Hip-Hop/Rap didn’t start becoming popular until around 1990.
Now let’s look at the average current rankings of Soul/Gospel/R&B for each year in this table (which is 2003, 2012, and 2020):
rank_genre <- df2 %>%
filter(genre == "Soul/Gospel/R&B", !is.na(rank)) %>%
group_by(year) %>%
summarise(mean_rank = mean(rank))
rank_genre
## # A tibble: 3 × 2
## year mean_rank
## <chr> <dbl>
## 1 2003 226.
## 2 2012 226.
## 3 2020 265.
ggplot(data = rank_genre, aes(x = year, y = mean_rank)) +
geom_point()
What’s interesting is that 2003 and 2012 have similar rankings, but 2020 there is a jump.
Now let’s focus on the year 2020 and average weeks on the billboard for each genre:
genre_2020 <- df2 %>%
filter(year == "2020", !is.na(weeks_on_billboard)) %>%
group_by(genre) %>%
summarise(mean_weeks_on_billboard = mean(weeks_on_billboard))
genre_2020
## # A tibble: 16 × 2
## genre mean_weeks_on_billboard
## <chr> <dbl>
## 1 Big Band/Jazz 36.8
## 2 Blues/Blues ROck 42
## 3 Blues/Blues Rock 46.7
## 4 Country/Folk/Country Rock/Folk Rock 58.4
## 5 Electronic 45.9
## 6 Funk/Disco 49.2
## 7 Hard Rock/Metal 89
## 8 Hip-Hop/Rap 85.6
## 9 Indie/Alternative Rock 74.3
## 10 Latin 54.6
## 11 Punk/Post-Punk/New Wave/Power Pop 42.8
## 12 Reggae 17.1
## 13 Rock n' Roll/Rhythm & Blues 47.8
## 14 Singer-Songwriter/Heartland Rock 86.8
## 15 Soul/Gospel/R&B 79.2
## 16 <NA> 62.1
ggplot(genre_2020, aes(x=genre, y=mean_weeks_on_billboard)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
As you can see, Hard Rock/Metal, Heartland Rock and Hip-Hop/Rap take the lead.
Now let’s look at the most popular artists based off of most weeks on the billboard for all albums:
artist_weeks_on_billboard <- df2 %>%
filter(year == "2020", !is.na(weeks_on_billboard)) %>%
group_by(clean_name) %>%
summarise(sum_weeks_on_billboard = sum(weeks_on_billboard)) %>%
arrange(desc(sum_weeks_on_billboard))
artist_weeks_on_billboard
## # A tibble: 321 × 2
## clean_name sum_weeks_on_billboard
## <chr> <dbl>
## 1 Kendrick Lamar 1042
## 2 Pink Floyd 906
## 3 The Beatles 845
## 4 Drake 758
## 5 Taylor Swift 649
## 6 Bruce Springsteen 629
## 7 Adele 588
## 8 Rolling Stones 525
## 9 U2 522
## 10 Frank Ocean 504
## # ℹ 311 more rows
As you can see, Kendrick Lamar takes the lead.
Let’s see if there is a particular album from Kendrick that is causing this:
kendrick_albums <- df2 %>%
filter(year == "2020", clean_name == "Kendrick Lamar") %>%
group_by(album) %>%
summarise(sum_weeks_on_billboard = sum(weeks_on_billboard))
kendrick_albums
## # A tibble: 3 × 2
## album sum_weeks_on_billboard
## <chr> <dbl>
## 1 DAMN 342
## 2 To Pimp a Butterfly 125
## 3 good kid, m.A.A.d city 575
kendrick_albums2 <- kendrick_albums %>%
mutate(weeks_on_billboard_perc = sum_weeks_on_billboard / sum(sum_weeks_on_billboard)) %>%
arrange(desc(weeks_on_billboard_perc))
kendrick_albums2
## # A tibble: 3 × 3
## album sum_weeks_on_billboard weeks_on_billboard_perc
## <chr> <dbl> <dbl>
## 1 good kid, m.A.A.d city 575 0.552
## 2 DAMN 342 0.328
## 3 To Pimp a Butterfly 125 0.120
ggplot(kendrick_albums2, aes(x=album, y=weeks_on_billboard_perc)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
good kid, m.A.A.d city has the most weeks on the billboard. This could
also mean that this album has been out for the longest amount of time
(earliest release year).
Finally let’s take a look at gender and Spotify popularity:
gender <- df2 %>%
filter(year == "2020", !is.na(spotify_popularity)) %>%
group_by(artist_gender) %>%
summarise(mean_spotify_popularity = mean(spotify_popularity))
gender
## # A tibble: 4 × 2
## artist_gender mean_spotify_popularity
## <chr> <dbl>
## 1 Female 58.1
## 2 Male 55.8
## 3 Male/Female 53.3
## 4 <NA> 25
ggplot(gender, aes(x=artist_gender, y=mean_spotify_popularity)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
As you can see, Females on average have a slightly more popular rating than Males and groups/bands (Male/Female).
Taking a look at 2020, let’s see which genres Males and Females are creating albums for:
female_solo <- df2 %>%
filter(year == "2020", !is.na(spotify_popularity), artist_gender == "Female") %>%
group_by(genre) %>%
summarise(mean_spotify_popularity = mean(spotify_popularity)) %>%
arrange(desc(mean_spotify_popularity))
female_solo
## # A tibble: 12 × 2
## genre mean_spotify_popularity
## <chr> <dbl>
## 1 Latin 73
## 2 Singer-Songwriter/Heartland Rock 68
## 3 Funk/Disco 64.5
## 4 Hip-Hop/Rap 64
## 5 <NA> 61.4
## 6 Soul/Gospel/R&B 60.2
## 7 Indie/Alternative Rock 57.3
## 8 Electronic 56
## 9 Country/Folk/Country Rock/Folk Rock 51.4
## 10 Punk/Post-Punk/New Wave/Power Pop 49.8
## 11 Big Band/Jazz 45
## 12 Blues/Blues ROck 27
ggplot(female_solo, aes(x=genre, y=mean_spotify_popularity)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
male_solo <- df2 %>%
filter(year == "2020", !is.na(spotify_popularity), artist_gender == "Male") %>%
group_by(genre) %>%
summarise(mean_spotify_popularity = mean(spotify_popularity)) %>%
arrange(desc(mean_spotify_popularity))
male_solo
## # A tibble: 16 × 2
## genre mean_spotify_popularity
## <chr> <dbl>
## 1 Latin 74
## 2 Indie/Alternative Rock 64.7
## 3 Hip-Hop/Rap 64.2
## 4 Hard Rock/Metal 60.8
## 5 Reggae 59.2
## 6 Country/Folk/Country Rock/Folk Rock 57.5
## 7 Electronic 56.8
## 8 Soul/Gospel/R&B 54.7
## 9 Singer-Songwriter/Heartland Rock 54.4
## 10 Punk/Post-Punk/New Wave/Power Pop 52.3
## 11 <NA> 52.0
## 12 Blues/Blues Rock 52.0
## 13 Funk/Disco 51.1
## 14 Rock n' Roll/Rhythm & Blues 51
## 15 Big Band/Jazz 44.3
## 16 Afrobeat 31.5
ggplot(male_solo, aes(x=genre, y=mean_spotify_popularity)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
group <- df2 %>%
filter(year == "2020", !is.na(spotify_popularity), artist_gender == "Male/Female") %>%
group_by(genre) %>%
summarise(mean_spotify_popularity = mean(spotify_popularity)) %>%
arrange(desc(mean_spotify_popularity))
group
## # A tibble: 10 × 2
## genre mean_spotify_popularity
## <chr> <dbl>
## 1 Blues/Blues Rock 68.3
## 2 Electronic 65
## 3 Singer-Songwriter/Heartland Rock 64.5
## 4 Indie/Alternative Rock 63
## 5 Reggae 61
## 6 <NA> 51.8
## 7 Soul/Gospel/R&B 51.7
## 8 Punk/Post-Punk/New Wave/Power Pop 49
## 9 Funk/Disco 48.4
## 10 Hip-Hop/Rap 43
ggplot(group, aes(x=genre, y=mean_spotify_popularity)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
For Males and Females, Latin takes the lead. For Females, Singer-Songwriter has a higher average ranking than Female. Males have a higher Indie/Alternative Rock.
For groups, Blues/Blues Rock takes the lead as the most popular genre on average.
I think this type of data set was a good learning experience for taking variables representing individual years and turning them into part of an observation as a value in a general year column.