About Dataset


The data this week comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. Make sure to check out the spotifyr package website to see how you can collect your own data!

Kaylin Pavlik had a recent blogpost using the audio features to explore and classify songs. She used the spotifyr package to collect about 5000 songs from 6 main categories (EDM, Latin, Pop, R&B, Rap, & Rock).

More Information About Dataset.

Loading Library and Data


library(tidyverse)
library(janitor)
library(knitr)
library(skimr)

spotify_songs_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

I used several packages to help me do this task. That is Tidyverse, janitor, skimr and knitr

Data Preparation


spotify_songs <- spotify_songs_raw %>% 
  mutate(
    mode = case_when(mode == 1 ~ "major",
                     mode == 0 ~ "minor"),
    track_album_release_date = as.Date(track_album_release_date),
    year = as.numeric(format(track_album_release_date,'%Y')),
    era = case_when(year <= 2000 ~ "under 00s era",
                    year > 2000 & year <= 2010 ~ "2010 era",
                    year > 2010 ~ "modern era"),
    duration_min = duration_ms/60000) %>%
  select(-track_id, -track_album_id, -playlist_id, -duration_ms)

spotify_songs %>% 
  select(track_name, track_artist, duration_min,
         track_popularity, playlist_genre, 
         tempo) %>%
  head() %>%
  kable()
track_name track_artist duration_min track_popularity playlist_genre tempo
I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 3.2 66 pop 122
Memories - Dillon Francis Remix Maroon 5 2.7 67 pop 100
All the Time - Don Diablo Remix Zara Larsson 2.9 70 pop 124
Call You Mine - Keanu Silva Remix The Chainsmokers 2.8 60 pop 122
Someone You Loved - Future Humans Remix Lewis Capaldi 3.1 69 pop 124
Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran 2.7 67 pop 125

Summary


Numeric Variable

spotify_songs %>% 
  skim() %>%
  yank('numeric') %>% 
  select(-p0, -p25, -p75, -p100) %>%
  kable() 
skim_variable n_missing complete_rate mean sd p50 hist
track_popularity 0 1.00 42.48 24.98 45.00 ▆▆▇▆▁
danceability 0 1.00 0.65 0.15 0.67 ▁▁▃▇▃
energy 0 1.00 0.70 0.18 0.72 ▁▁▅▇▇
key 0 1.00 5.37 3.61 6.00 ▇▂▅▅▆
loudness 0 1.00 -6.72 2.99 -6.17 ▁▁▁▂▇
speechiness 0 1.00 0.11 0.10 0.06 ▇▂▁▁▁
acousticness 0 1.00 0.18 0.22 0.08 ▇▂▁▁▁
instrumentalness 0 1.00 0.08 0.22 0.00 ▇▁▁▁▁
liveness 0 1.00 0.19 0.15 0.13 ▇▃▁▁▁
valence 0 1.00 0.51 0.23 0.51 ▃▇▇▇▃
tempo 0 1.00 120.88 26.90 121.98 ▁▂▇▂▁
year 1886 0.94 2012.20 10.40 2017.00 ▁▁▁▁▇
duration_min 0 1.00 3.76 1.00 3.60 ▁▇▇▁▁

Character Variable

spotify_songs %>% 
  skim() %>%
  yank('character') %>% 
  select(-whitespace, -empty) %>%
  kable() 
skim_variable n_missing complete_rate min max n_unique
track_name 5 1.00 1 144 23449
track_artist 5 1.00 2 69 10692
track_album_name 5 1.00 1 151 19743
playlist_name 0 1.00 6 120 449
playlist_genre 0 1.00 3 5 6
playlist_subgenre 0 1.00 4 25 24
mode 0 1.00 5 5 2
era 1886 0.94 8 13 3

Date Variable

spotify_songs %>% 
  skim() %>%
  yank('Date') %>% 
  #select() %>%
  kable() 
skim_variable n_missing complete_rate min max median n_unique
track_album_release_date 1886 0.94 1957-01-01 2020-01-29 2017-01-27 4453

Data Exploration


How is the dominance of mode/modality in every genre?

spotify_songs %>%
  group_by(mode, playlist_genre) %>%
  count(mode, sort = TRUE) %>%
  ggplot() +
  geom_bar(aes(mode, n, fill=mode), stat="identity") +
  labs(x='Proportion', y='Playlist Genre') +
  facet_wrap(~playlist_genre, nrow = 2)+
  theme_minimal() +
  theme(legend.position = 'bottom')

Major mode/modality is most often used in all genres on Spotify.

How is the dominance of mode/modality in every subgenre?

spotify_songs %>%
  group_by(playlist_genre, mode) %>%
  count(playlist_subgenre) %>%
  ggplot() +
  geom_bar(aes(reorder(playlist_subgenre, n), n, fill=mode), position = 'stack', stat='identity') +
  coord_flip() +
  facet_wrap(playlist_genre~., scales="free", nrow=3) +
  labs(x='Sub Genre', y='Count') +
  theme_minimal() +
  theme(legend.position = 'bottom')

The sub genre of rock has the most dominant major mode/modality than other sub genres. The sub genre of rap has a balanced mode/modality relatively.

Which song from the era is the most on Spotify?

spotify_songs %>%
  group_by(era) %>%
  count(era, sort = TRUE) %>%
  na.omit() %>%
  ggplot(aes(era, y=n, fill=era)) +
  geom_bar(stat="identity") +
  labs(x='Era', y='Count') +
  theme_minimal()

song from the modern era (released above in 2010) most in spotify.

How about the tempo in each genre and Era?

spotify_songs %>%
  group_by(playlist_genre, era) %>%
  na.omit() %>%
  ggplot() +
  geom_boxplot(aes(era, tempo, fill=era), outlier.size=0) +
  facet_grid(~playlist_genre) + 
  coord_flip() +
  labs(x='Tempo') +
  theme_minimal() +
  theme(legend.position = 'none')

Rap song in the under 00s era had a slower tempo compared to other eras. EDM song has small variation about tempo in every era.

How the distribution of duration in each Genre and Era?

spotify_songs %>%
  na.omit(era) %>%
  ggplot() +
  geom_density(aes(duration_min, fill=playlist_genre)) +
  facet_grid(era~playlist_genre) +
  theme_minimal() +
  labs(x='Duration (min)') +
  theme(legend.position = 'none')

EDM song in the 2010 era has a relatively longer duration than in other eras. Rap song in the modern era has a relatively shorter duration than in other eras. While others have relatively the same duration in each era.

How about audio Features in every genre?

spotify_songs %>%
    select(danceability, energy, -loudness, speechiness, acousticness, instrumentalness, liveness, valence,
           playlist_genre,
           -track_popularity, -key, -tempo, -year, -duration_min) %>% 
  gather(key=status, value=value, -playlist_genre) %>%
  ggplot() +
  geom_bar(aes(status, value, fill=status), stat='identity') +
  coord_flip() +
  facet_wrap(~playlist_genre, nrow=2) +
  theme_minimal() +
  labs(x='Value') +
  theme(legend.position = 'none')

Thank You

Amri Rohman.
Sidoarjo, East Java, ID