The data this week comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. Make sure to check out the spotifyr package website to see how you can collect your own data!
Kaylin Pavlik had a recent blogpost using the audio features to explore and classify songs. She used the spotifyr package to collect about 5000 songs from 6 main categories (EDM, Latin, Pop, R&B, Rap, & Rock).
library(tidyverse)
library(janitor)
library(knitr)
library(skimr)
spotify_songs_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
I used several packages to help me do this task. That is Tidyverse, janitor, skimr and knitr
spotify_songs <- spotify_songs_raw %>%
mutate(
mode = case_when(mode == 1 ~ "major",
mode == 0 ~ "minor"),
track_album_release_date = as.Date(track_album_release_date),
year = as.numeric(format(track_album_release_date,'%Y')),
era = case_when(year <= 2000 ~ "under 00s era",
year > 2000 & year <= 2010 ~ "2010 era",
year > 2010 ~ "modern era"),
duration_min = duration_ms/60000) %>%
select(-track_id, -track_album_id, -playlist_id, -duration_ms)
spotify_songs %>%
select(track_name, track_artist, duration_min,
track_popularity, playlist_genre,
tempo) %>%
head() %>%
kable()
| track_name | track_artist | duration_min | track_popularity | playlist_genre | tempo |
|---|---|---|---|---|---|
| I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 3.2 | 66 | pop | 122 |
| Memories - Dillon Francis Remix | Maroon 5 | 2.7 | 67 | pop | 100 |
| All the Time - Don Diablo Remix | Zara Larsson | 2.9 | 70 | pop | 124 |
| Call You Mine - Keanu Silva Remix | The Chainsmokers | 2.8 | 60 | pop | 122 |
| Someone You Loved - Future Humans Remix | Lewis Capaldi | 3.1 | 69 | pop | 124 |
| Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 2.7 | 67 | pop | 125 |
spotify_songs %>%
skim() %>%
yank('numeric') %>%
select(-p0, -p25, -p75, -p100) %>%
kable()
| skim_variable | n_missing | complete_rate | mean | sd | p50 | hist |
|---|---|---|---|---|---|---|
| track_popularity | 0 | 1.00 | 42.48 | 24.98 | 45.00 | ▆▆▇▆▁ |
| danceability | 0 | 1.00 | 0.65 | 0.15 | 0.67 | ▁▁▃▇▃ |
| energy | 0 | 1.00 | 0.70 | 0.18 | 0.72 | ▁▁▅▇▇ |
| key | 0 | 1.00 | 5.37 | 3.61 | 6.00 | ▇▂▅▅▆ |
| loudness | 0 | 1.00 | -6.72 | 2.99 | -6.17 | ▁▁▁▂▇ |
| speechiness | 0 | 1.00 | 0.11 | 0.10 | 0.06 | ▇▂▁▁▁ |
| acousticness | 0 | 1.00 | 0.18 | 0.22 | 0.08 | ▇▂▁▁▁ |
| instrumentalness | 0 | 1.00 | 0.08 | 0.22 | 0.00 | ▇▁▁▁▁ |
| liveness | 0 | 1.00 | 0.19 | 0.15 | 0.13 | ▇▃▁▁▁ |
| valence | 0 | 1.00 | 0.51 | 0.23 | 0.51 | ▃▇▇▇▃ |
| tempo | 0 | 1.00 | 120.88 | 26.90 | 121.98 | ▁▂▇▂▁ |
| year | 1886 | 0.94 | 2012.20 | 10.40 | 2017.00 | ▁▁▁▁▇ |
| duration_min | 0 | 1.00 | 3.76 | 1.00 | 3.60 | ▁▇▇▁▁ |
spotify_songs %>%
skim() %>%
yank('character') %>%
select(-whitespace, -empty) %>%
kable()
| skim_variable | n_missing | complete_rate | min | max | n_unique |
|---|---|---|---|---|---|
| track_name | 5 | 1.00 | 1 | 144 | 23449 |
| track_artist | 5 | 1.00 | 2 | 69 | 10692 |
| track_album_name | 5 | 1.00 | 1 | 151 | 19743 |
| playlist_name | 0 | 1.00 | 6 | 120 | 449 |
| playlist_genre | 0 | 1.00 | 3 | 5 | 6 |
| playlist_subgenre | 0 | 1.00 | 4 | 25 | 24 |
| mode | 0 | 1.00 | 5 | 5 | 2 |
| era | 1886 | 0.94 | 8 | 13 | 3 |
spotify_songs %>%
skim() %>%
yank('Date') %>%
#select() %>%
kable()
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| track_album_release_date | 1886 | 0.94 | 1957-01-01 | 2020-01-29 | 2017-01-27 | 4453 |
spotify_songs %>%
select(track_name, track_artist, track_popularity) %>%
arrange(-track_popularity) %>%
distinct() %>%
filter(track_popularity %in% c(97:100)) %>%
kable()
| track_name | track_artist | track_popularity |
|---|---|---|
| Dance Monkey | Tones and I | 100 |
| ROXANNE | Arizona Zervas | 99 |
| Tusa | KAROL G | 98 |
| Memories | Maroon 5 | 98 |
| Blinding Lights | The Weeknd | 98 |
| Circles | Post Malone | 98 |
| The Box | Roddy Ricch | 98 |
| everything i wanted | Billie Eilish | 97 |
| Don’t Start Now | Dua Lipa | 97 |
| Falling | Trevor Daniel | 97 |
The first most popular song in Spotify is Dance Monkey - Tones and I. Followed by ROXANNEArizona Zervas and the others.
spotify_songs %>%
group_by(mode, playlist_genre) %>%
count(mode, sort = TRUE) %>%
ggplot() +
geom_bar(aes(mode, n, fill=mode), stat="identity") +
labs(x='Proportion', y='Playlist Genre') +
facet_wrap(~playlist_genre, nrow = 2)+
theme_minimal() +
theme(legend.position = 'bottom')
Major mode/modality is most often used in all genres on Spotify.
spotify_songs %>%
group_by(playlist_genre, mode) %>%
count(playlist_subgenre) %>%
ggplot() +
geom_bar(aes(reorder(playlist_subgenre, n), n, fill=mode), position = 'stack', stat='identity') +
coord_flip() +
facet_wrap(playlist_genre~., scales="free", nrow=3) +
labs(x='Sub Genre', y='Count') +
theme_minimal() +
theme(legend.position = 'bottom')
The sub genre of rock has the most dominant major mode/modality than other sub genres. The sub genre of rap has a balanced mode/modality relatively.
spotify_songs %>%
group_by(era) %>%
count(era, sort = TRUE) %>%
na.omit() %>%
ggplot(aes(era, y=n, fill=era)) +
geom_bar(stat="identity") +
labs(x='Era', y='Count') +
theme_minimal()
song from the modern era (released above in 2010) most in spotify.
spotify_songs %>%
group_by(playlist_genre, era) %>%
na.omit() %>%
ggplot() +
geom_boxplot(aes(era, tempo, fill=era), outlier.size=0) +
facet_grid(~playlist_genre) +
coord_flip() +
labs(x='Tempo') +
theme_minimal() +
theme(legend.position = 'none')
Rap song in the under 00s era had a slower tempo compared to other eras. EDM song has small variation about tempo in every era.
spotify_songs %>%
na.omit(era) %>%
ggplot() +
geom_density(aes(duration_min, fill=playlist_genre)) +
facet_grid(era~playlist_genre) +
theme_minimal() +
labs(x='Duration (min)') +
theme(legend.position = 'none')
EDM song in the 2010 era has a relatively longer duration than in other eras. Rap song in the modern era has a relatively shorter duration than in other eras. While others have relatively the same duration in each era.
spotify_songs %>%
select(danceability, energy, -loudness, speechiness, acousticness, instrumentalness, liveness, valence,
playlist_genre,
-track_popularity, -key, -tempo, -year, -duration_min) %>%
gather(key=status, value=value, -playlist_genre) %>%
ggplot() +
geom_bar(aes(status, value, fill=status), stat='identity') +
coord_flip() +
facet_wrap(~playlist_genre, nrow=2) +
theme_minimal() +
labs(x='Value') +
theme(legend.position = 'none')
Amri Rohman.
Sidoarjo, East Java, ID