Project 2 - Dataset 2

Dataset 1 - “Rolling Stone Album Rankings”

For the second untidy data set, I chose the Rolling Stone Album Rankings. This data set compares album rankings from 2003, 2012, and, 2020.

This data set is untidy because each row represents an album for multiple years (each year being a unique variable). If we now wanted to add in the year 2024, we would need to add a column making the data frame wider. Instead, each observation should represent and album and year combination.

Source: https://github.com/rfordatascience/tidytuesday/tree/main/data/2024/2024-05-07

Load the Libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rvest)

## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(stringi)
library(tidyr)

Read the Data

df <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/refs/heads/main/data/2024/2024-05-07/rolling_stone.csv")

## Rows: 691 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): sort_name, clean_name, album, genre, type, spotify_url, artist_gen...
## dbl (13): rank_2003, rank_2012, rank_2020, differential, release_year, weeks...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df

## # A tibble: 691 × 21
##    sort_name      clean_name    album rank_2003 rank_2012 rank_2020 differential
##    <chr>          <chr>         <chr>     <dbl>     <dbl>     <dbl>        <dbl>
##  1 Sinatra, Frank Frank Sinatra "In …       100       101       282         -182
##  2 Diddley, Bo    Bo Diddley    "Bo …       214       216       455         -241
##  3 Presley, Elvis Elvis Presley "Elv…        55        56       332         -277
##  4 Sinatra, Frank Frank Sinatra "Son…       306       308        NA         -195
##  5 Little Richard Little Richa… "Her…        50        50       227         -177
##  6 Beyonce        Beyonce       "Lem…        NA        NA        32          469
##  7 Winehouse, Amy Amy Winehouse "Bac…        NA       451        33          468
##  8 Crickets       Buddy Holly   "The…       421       420        NA          -80
##  9 Bush, Kate     Kate Bush     "Hou…        NA        NA        68          433
## 10 Davis, Miles   Miles Davis   "Kin…        12        12        31          -19
## # ℹ 681 more rows
## # ℹ 14 more variables: release_year <dbl>, genre <chr>, type <chr>,
## #   weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## #   spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## #   artist_gender <chr>, artist_birth_year_sum <dbl>,
## #   debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## #   years_between <dbl>, album_id <chr>

We now have the data frame from Excel, so let’s tidy up the data.

Tidy the Data

The individual year variables are actually values that should be in a general year column:

df2 <- df %>%
  pivot_longer(cols = rank_2003:rank_2020, names_to = "year", values_to = "rank")
head(df2, 10)

## # A tibble: 10 × 20
##    sort_name      clean_name    album      differential release_year genre type 
##    <chr>          <chr>         <chr>             <dbl>        <dbl> <chr> <chr>
##  1 Sinatra, Frank Frank Sinatra In the We…         -182         1955 Big … Stud…
##  2 Sinatra, Frank Frank Sinatra In the We…         -182         1955 Big … Stud…
##  3 Sinatra, Frank Frank Sinatra In the We…         -182         1955 Big … Stud…
##  4 Diddley, Bo    Bo Diddley    Bo Diddle…         -241         1955 Rock… Stud…
##  5 Diddley, Bo    Bo Diddley    Bo Diddle…         -241         1955 Rock… Stud…
##  6 Diddley, Bo    Bo Diddley    Bo Diddle…         -241         1955 Rock… Stud…
##  7 Presley, Elvis Elvis Presley Elvis Pre…         -277         1956 Rock… Stud…
##  8 Presley, Elvis Elvis Presley Elvis Pre…         -277         1956 Rock… Stud…
##  9 Presley, Elvis Elvis Presley Elvis Pre…         -277         1956 Rock… Stud…
## 10 Sinatra, Frank Frank Sinatra Songs for…         -195         1956 Big … Stud…
## # ℹ 13 more variables: weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## #   spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## #   artist_gender <chr>, artist_birth_year_sum <dbl>,
## #   debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## #   years_between <dbl>, album_id <chr>, year <chr>, rank <dbl>

We now have a tidy data frame where each row represent a year and album combination. Note, I did not change the meaning of any of the current variables such as “differential” because I would be changing the data.

Let’s also change the year variable to exclude rank_ from the values since that came from the transformation.

df2$year <- str_sub(df2$year,6)
df2

## # A tibble: 2,073 × 20
##    sort_name      clean_name    album      differential release_year genre type 
##    <chr>          <chr>         <chr>             <dbl>        <dbl> <chr> <chr>
##  1 Sinatra, Frank Frank Sinatra In the We…         -182         1955 Big … Stud…
##  2 Sinatra, Frank Frank Sinatra In the We…         -182         1955 Big … Stud…
##  3 Sinatra, Frank Frank Sinatra In the We…         -182         1955 Big … Stud…
##  4 Diddley, Bo    Bo Diddley    Bo Diddle…         -241         1955 Rock… Stud…
##  5 Diddley, Bo    Bo Diddley    Bo Diddle…         -241         1955 Rock… Stud…
##  6 Diddley, Bo    Bo Diddley    Bo Diddle…         -241         1955 Rock… Stud…
##  7 Presley, Elvis Elvis Presley Elvis Pre…         -277         1956 Rock… Stud…
##  8 Presley, Elvis Elvis Presley Elvis Pre…         -277         1956 Rock… Stud…
##  9 Presley, Elvis Elvis Presley Elvis Pre…         -277         1956 Rock… Stud…
## 10 Sinatra, Frank Frank Sinatra Songs for…         -195         1956 Big … Stud…
## # ℹ 2,063 more rows
## # ℹ 13 more variables: weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## #   spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## #   artist_gender <chr>, artist_birth_year_sum <dbl>,
## #   debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## #   years_between <dbl>, album_id <chr>, year <chr>, rank <dbl>

We now have a tidy data frame that represents the same data as the original, but is in a tidy format.

Analysis

For the analysis, let’s first see how the average Spotify popularity for each album release year changes over time. How do albums released a few decades ago compare to recent albums for a modern music platform like Spotify?

spotify_popularity_release_year <- df2 %>%
  group_by(release_year) %>%
  summarise(mean_spotify_popularity = mean(spotify_popularity))
spotify_popularity_release_year

## # A tibble: 65 × 2
##    release_year mean_spotify_popularity
##           <dbl>                   <dbl>
##  1         1955                    49  
##  2         1956                    60  
##  3         1957                    55.5
##  4         1958                    47  
##  5         1959                    43.5
##  6         1960                    45.3
##  7         1961                    NA  
##  8         1962                    NA  
##  9         1963                    59  
## 10         1964                    NA  
## # ℹ 55 more rows

ggplot(data = spotify_popularity_release_year, aes(x = release_year, y = mean_spotify_popularity)) +
  geom_line()

This is not the best graph since there isn’t a top Rolling Stone album released each year, but we can somewhat see the general trend. It does seem like older, more classic albums, have lower popularity than recent albums. This makes sense because younger people are more likely to use Spotify and might not be listening to older, more classic albums.

Now let’s add in genre to see if that has changed over time:

spotify_popularity_release_year <- df2 %>%
  group_by(release_year, genre) %>%
  summarise(mean_spotify_popularity = mean(spotify_popularity))

## `summarise()` has grouped output by 'release_year'. You can override using the
## `.groups` argument.

ggplot(data = spotify_popularity_release_year, aes(x = release_year, y = mean_spotify_popularity, color = genre)) +
  geom_point()

## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).

Since there are so many genres, it makes the graph a bit difficult to read. Looking at the colors, you can see there are a cluster of Soul/Gospel/R&B albums with a mean ranking of almost 50 that were released a little before 1970. 1960 - 1990 has the most Soul/Gospel/R&B albums. Additionally, Hip-Hop/Rap didn’t start becoming popular until around 1990.

Now let’s look at the average current rankings of Soul/Gospel/R&B for each year in this table (which is 2003, 2012, and 2020):

rank_genre <- df2 %>%
  filter(genre == "Soul/Gospel/R&B", !is.na(rank)) %>%
  group_by(year) %>%
  summarise(mean_rank = mean(rank))
rank_genre

## # A tibble: 3 × 2
##   year  mean_rank
##   <chr>     <dbl>
## 1 2003       226.
## 2 2012       226.
## 3 2020       265.

ggplot(data = rank_genre, aes(x = year, y = mean_rank)) +
  geom_point()

What’s interesting is that 2003 and 2012 have similar rankings, but 2020 there is a jump.

Now let’s focus on the year 2020 and average weeks on the billboard for each genre:

genre_2020 <- df2 %>%
  filter(year == "2020", !is.na(weeks_on_billboard)) %>%
  group_by(genre) %>%
  summarise(mean_weeks_on_billboard = mean(weeks_on_billboard))
genre_2020

## # A tibble: 16 × 2
##    genre                               mean_weeks_on_billboard
##    <chr>                                                 <dbl>
##  1 Big Band/Jazz                                          36.8
##  2 Blues/Blues ROck                                       42  
##  3 Blues/Blues Rock                                       46.7
##  4 Country/Folk/Country Rock/Folk Rock                    58.4
##  5 Electronic                                             45.9
##  6 Funk/Disco                                             49.2
##  7 Hard Rock/Metal                                        89  
##  8 Hip-Hop/Rap                                            85.6
##  9 Indie/Alternative Rock                                 74.3
## 10 Latin                                                  54.6
## 11 Punk/Post-Punk/New Wave/Power Pop                      42.8
## 12 Reggae                                                 17.1
## 13 Rock n' Roll/Rhythm & Blues                            47.8
## 14 Singer-Songwriter/Heartland Rock                       86.8
## 15 Soul/Gospel/R&B                                        79.2
## 16 <NA>                                                   62.1

ggplot(genre_2020, aes(x=genre, y=mean_weeks_on_billboard)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

As you can see, Hard Rock/Metal, Heartland Rock and Hip-Hop/Rap take the lead.

Now let’s look at the most popular artists based off of most weeks on the billboard for all albums:

artist_weeks_on_billboard <- df2 %>%
  filter(year == "2020", !is.na(weeks_on_billboard)) %>%
  group_by(clean_name) %>%
  summarise(sum_weeks_on_billboard = sum(weeks_on_billboard)) %>%
  arrange(desc(sum_weeks_on_billboard))
artist_weeks_on_billboard

## # A tibble: 321 × 2
##    clean_name        sum_weeks_on_billboard
##    <chr>                              <dbl>
##  1 Kendrick Lamar                      1042
##  2 Pink Floyd                           906
##  3 The Beatles                          845
##  4 Drake                                758
##  5 Taylor Swift                         649
##  6 Bruce Springsteen                    629
##  7 Adele                                588
##  8 Rolling Stones                       525
##  9 U2                                   522
## 10 Frank Ocean                          504
## # ℹ 311 more rows

As you can see, Kendrick Lamar takes the lead.

Let’s see if there is a particular album from Kendrick that is causing this:

kendrick_albums <- df2 %>%
  filter(year == "2020", clean_name == "Kendrick Lamar") %>%
  group_by(album) %>%
  summarise(sum_weeks_on_billboard = sum(weeks_on_billboard))
kendrick_albums

## # A tibble: 3 × 2
##   album                  sum_weeks_on_billboard
##   <chr>                                   <dbl>
## 1 DAMN                                      342
## 2 To Pimp a Butterfly                       125
## 3 good kid, m.A.A.d city                    575

kendrick_albums2 <- kendrick_albums %>%
  mutate(weeks_on_billboard_perc = sum_weeks_on_billboard / sum(sum_weeks_on_billboard)) %>%
  arrange(desc(weeks_on_billboard_perc))
kendrick_albums2

## # A tibble: 3 × 3
##   album                  sum_weeks_on_billboard weeks_on_billboard_perc
##   <chr>                                   <dbl>                   <dbl>
## 1 good kid, m.A.A.d city                    575                   0.552
## 2 DAMN                                      342                   0.328
## 3 To Pimp a Butterfly                       125                   0.120

ggplot(kendrick_albums2, aes(x=album, y=weeks_on_billboard_perc)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

good kid, m.A.A.d city has the most weeks on the billboard. This could also mean that this album has been out for the longest amount of time (earliest release year).

Finally let’s take a look at gender and Spotify popularity:

gender <- df2 %>%
  filter(year == "2020", !is.na(spotify_popularity)) %>%
  group_by(artist_gender) %>%
  summarise(mean_spotify_popularity = mean(spotify_popularity))
gender

## # A tibble: 4 × 2
##   artist_gender mean_spotify_popularity
##   <chr>                           <dbl>
## 1 Female                           58.1
## 2 Male                             55.8
## 3 Male/Female                      53.3
## 4 <NA>                             25

ggplot(gender, aes(x=artist_gender, y=mean_spotify_popularity)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

As you can see, Females on average have a slightly more popular rating than Males and groups/bands (Male/Female).

Taking a look at 2020, let’s see which genres Males and Females are creating albums for:

female_solo <- df2 %>%
  filter(year == "2020", !is.na(spotify_popularity), artist_gender == "Female") %>%
  group_by(genre) %>%
  summarise(mean_spotify_popularity = mean(spotify_popularity)) %>%
  arrange(desc(mean_spotify_popularity))
female_solo

## # A tibble: 12 × 2
##    genre                               mean_spotify_popularity
##    <chr>                                                 <dbl>
##  1 Latin                                                  73  
##  2 Singer-Songwriter/Heartland Rock                       68  
##  3 Funk/Disco                                             64.5
##  4 Hip-Hop/Rap                                            64  
##  5 <NA>                                                   61.4
##  6 Soul/Gospel/R&B                                        60.2
##  7 Indie/Alternative Rock                                 57.3
##  8 Electronic                                             56  
##  9 Country/Folk/Country Rock/Folk Rock                    51.4
## 10 Punk/Post-Punk/New Wave/Power Pop                      49.8
## 11 Big Band/Jazz                                          45  
## 12 Blues/Blues ROck                                       27

ggplot(female_solo, aes(x=genre, y=mean_spotify_popularity)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

male_solo <- df2 %>%
  filter(year == "2020", !is.na(spotify_popularity), artist_gender == "Male") %>%
  group_by(genre) %>%
  summarise(mean_spotify_popularity = mean(spotify_popularity)) %>%
  arrange(desc(mean_spotify_popularity))
male_solo

## # A tibble: 16 × 2
##    genre                               mean_spotify_popularity
##    <chr>                                                 <dbl>
##  1 Latin                                                  74  
##  2 Indie/Alternative Rock                                 64.7
##  3 Hip-Hop/Rap                                            64.2
##  4 Hard Rock/Metal                                        60.8
##  5 Reggae                                                 59.2
##  6 Country/Folk/Country Rock/Folk Rock                    57.5
##  7 Electronic                                             56.8
##  8 Soul/Gospel/R&B                                        54.7
##  9 Singer-Songwriter/Heartland Rock                       54.4
## 10 Punk/Post-Punk/New Wave/Power Pop                      52.3
## 11 <NA>                                                   52.0
## 12 Blues/Blues Rock                                       52.0
## 13 Funk/Disco                                             51.1
## 14 Rock n' Roll/Rhythm & Blues                            51  
## 15 Big Band/Jazz                                          44.3
## 16 Afrobeat                                               31.5

ggplot(male_solo, aes(x=genre, y=mean_spotify_popularity)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

group <- df2 %>%
  filter(year == "2020", !is.na(spotify_popularity), artist_gender == "Male/Female") %>%
  group_by(genre) %>%
  summarise(mean_spotify_popularity = mean(spotify_popularity)) %>%
  arrange(desc(mean_spotify_popularity))
group

## # A tibble: 10 × 2
##    genre                             mean_spotify_popularity
##    <chr>                                               <dbl>
##  1 Blues/Blues Rock                                     68.3
##  2 Electronic                                           65  
##  3 Singer-Songwriter/Heartland Rock                     64.5
##  4 Indie/Alternative Rock                               63  
##  5 Reggae                                               61  
##  6 <NA>                                                 51.8
##  7 Soul/Gospel/R&B                                      51.7
##  8 Punk/Post-Punk/New Wave/Power Pop                    49  
##  9 Funk/Disco                                           48.4
## 10 Hip-Hop/Rap                                          43

ggplot(group, aes(x=genre, y=mean_spotify_popularity)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

For Males and Females, Latin takes the lead. For Females, Singer-Songwriter has a higher average ranking than Female. Males have a higher Indie/Alternative Rock.

For groups, Blues/Blues Rock takes the lead as the most popular genre on average.

Conclusions

I think this type of data set was a good learning experience for taking variables representing individual years and turning them into part of an observation as a value in a general year column.