About Dataset

The data this week comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. Make sure to check out the spotifyr package website to see how you can collect your own data!

Kaylin Pavlik had a recent blogpost using the audio features to explore and classify songs. She used the spotifyr package to collect about 5000 songs from 6 main categories (EDM, Latin, Pop, R&B, Rap, & Rock).

More Information About Dataset.

Loading Library and Data

library(tidyverse)
library(janitor)
library(knitr)
library(skimr)

spotify_songs_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

I used several packages to help me do this task. That is Tidyverse, janitor, skimr and knitr

Data Preparation

spotify_songs <- spotify_songs_raw %>% 
  mutate(
    mode = case_when(mode == 1 ~ "major",
                     mode == 0 ~ "minor"),
    track_album_release_date = as.Date(track_album_release_date),
    year = as.numeric(format(track_album_release_date,'%Y')),
    era = case_when(year <= 2000 ~ "under 00s era",
                    year > 2000 & year <= 2010 ~ "2010 era",
                    year > 2010 ~ "modern era"),
    duration_min = duration_ms/60000) %>%
  select(-track_id, -track_album_id, -playlist_id, -duration_ms)

spotify_songs %>% 
  select(track_name, track_artist, duration_min,
         track_popularity, playlist_genre, 
         tempo) %>%
  head() %>%
  kable()

track_name	track_artist	duration_min	track_popularity	playlist_genre	tempo
I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	3.2	66	pop	122
Memories - Dillon Francis Remix	Maroon 5	2.7	67	pop	100
All the Time - Don Diablo Remix	Zara Larsson	2.9	70	pop	124
Call You Mine - Keanu Silva Remix	The Chainsmokers	2.8	60	pop	122
Someone You Loved - Future Humans Remix	Lewis Capaldi	3.1	69	pop	124
Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	2.7	67	pop	125

Summary

Numeric Variable

spotify_songs %>% 
  skim() %>%
  yank('numeric') %>% 
  select(-p0, -p25, -p75, -p100) %>%
  kable()

skim_variable	n_missing	complete_rate	mean	sd	p50	hist
track_popularity	0	1.00	42.48	24.98	45.00	▆▆▇▆▁
danceability	0	1.00	0.65	0.15	0.67	▁▁▃▇▃
energy	0	1.00	0.70	0.18	0.72	▁▁▅▇▇
key	0	1.00	5.37	3.61	6.00	▇▂▅▅▆
loudness	0	1.00	-6.72	2.99	-6.17	▁▁▁▂▇
speechiness	0	1.00	0.11	0.10	0.06	▇▂▁▁▁
acousticness	0	1.00	0.18	0.22	0.08	▇▂▁▁▁
instrumentalness	0	1.00	0.08	0.22	0.00	▇▁▁▁▁
liveness	0	1.00	0.19	0.15	0.13	▇▃▁▁▁
valence	0	1.00	0.51	0.23	0.51	▃▇▇▇▃
tempo	0	1.00	120.88	26.90	121.98	▁▂▇▂▁
year	1886	0.94	2012.20	10.40	2017.00	▁▁▁▁▇
duration_min	0	1.00	3.76	1.00	3.60	▁▇▇▁▁

Character Variable

spotify_songs %>% 
  skim() %>%
  yank('character') %>% 
  select(-whitespace, -empty) %>%
  kable()

skim_variable	n_missing	complete_rate	min	max	n_unique
track_name	5	1.00	1	144	23449
track_artist	5	1.00	2	69	10692
track_album_name	5	1.00	1	151	19743
playlist_name	0	1.00	6	120	449
playlist_genre	0	1.00	3	5	6
playlist_subgenre	0	1.00	4	25	24
mode	0	1.00	5	5	2
era	1886	0.94	8	13	3

Date Variable

spotify_songs %>% 
  skim() %>%
  yank('Date') %>% 
  #select() %>%
  kable()

skim_variable	n_missing	complete_rate	min	max	median	n_unique
track_album_release_date	1886	0.94	1957-01-01	2020-01-29	2017-01-27	4453

Data Exploration

What is the most popular song?

spotify_songs %>%
  select(track_name, track_artist, track_popularity) %>%
  arrange(-track_popularity) %>%
  distinct() %>%
  filter(track_popularity %in% c(97:100)) %>% 
  kable()

track_name	track_artist	track_popularity
Dance Monkey	Tones and I	100
ROXANNE	Arizona Zervas	99
Tusa	KAROL G	98
Memories	Maroon 5	98
Blinding Lights	The Weeknd	98
Circles	Post Malone	98
The Box	Roddy Ricch	98
everything i wanted	Billie Eilish	97
Don’t Start Now	Dua Lipa	97
Falling	Trevor Daniel	97

The first most popular song in Spotify is Dance Monkey - Tones and I. Followed by ROXANNEArizona Zervas and the others.

How is the dominance of mode/modality in every genre?

spotify_songs %>%
  group_by(mode, playlist_genre) %>%
  count(mode, sort = TRUE) %>%
  ggplot() +
  geom_bar(aes(mode, n, fill=mode), stat="identity") +
  labs(x='Proportion', y='Playlist Genre') +
  facet_wrap(~playlist_genre, nrow = 2)+
  theme_minimal() +
  theme(legend.position = 'bottom')

Major mode/modality is most often used in all genres on Spotify.

How is the dominance of mode/modality in every subgenre?

spotify_songs %>%
  group_by(playlist_genre, mode) %>%
  count(playlist_subgenre) %>%
  ggplot() +
  geom_bar(aes(reorder(playlist_subgenre, n), n, fill=mode), position = 'stack', stat='identity') +
  coord_flip() +
  facet_wrap(playlist_genre~., scales="free", nrow=3) +
  labs(x='Sub Genre', y='Count') +
  theme_minimal() +
  theme(legend.position = 'bottom')

The sub genre of rock has the most dominant major mode/modality than other sub genres. The sub genre of rap has a balanced mode/modality relatively.

Which song from the era is the most on Spotify?

spotify_songs %>%
  group_by(era) %>%
  count(era, sort = TRUE) %>%
  na.omit() %>%
  ggplot(aes(era, y=n, fill=era)) +
  geom_bar(stat="identity") +
  labs(x='Era', y='Count') +
  theme_minimal()

song from the modern era (released above in 2010) most in spotify.

How about the tempo in each genre and Era?

spotify_songs %>%
  group_by(playlist_genre, era) %>%
  na.omit() %>%
  ggplot() +
  geom_boxplot(aes(era, tempo, fill=era), outlier.size=0) +
  facet_grid(~playlist_genre) + 
  coord_flip() +
  labs(x='Tempo') +
  theme_minimal() +
  theme(legend.position = 'none')

Rap song in the under 00s era had a slower tempo compared to other eras. EDM song has small variation about tempo in every era.

How the distribution of duration in each Genre and Era?

spotify_songs %>%
  na.omit(era) %>%
  ggplot() +
  geom_density(aes(duration_min, fill=playlist_genre)) +
  facet_grid(era~playlist_genre) +
  theme_minimal() +
  labs(x='Duration (min)') +
  theme(legend.position = 'none')

EDM song in the 2010 era has a relatively longer duration than in other eras. Rap song in the modern era has a relatively shorter duration than in other eras. While others have relatively the same duration in each era.

How about audio Features in every genre?

spotify_songs %>%
    select(danceability, energy, -loudness, speechiness, acousticness, instrumentalness, liveness, valence,
           playlist_genre,
           -track_popularity, -key, -tempo, -year, -duration_min) %>% 
  gather(key=status, value=value, -playlist_genre) %>%
  ggplot() +
  geom_bar(aes(status, value, fill=status), stat='identity') +
  coord_flip() +
  facet_wrap(~playlist_genre, nrow=2) +
  theme_minimal() +
  labs(x='Value') +
  theme(legend.position = 'none')

Thank You

Amri Rohman.
Sidoarjo, East Java, ID

Tidytuesday - Spotify Song

Amri Rohman

21/5/2020