After recently completing DS4B 101: Business Analysis With R from Business Science, I’m feeling like a certified Tidyverse ninja and decided to take on a mini project of my own, analyzing my Spotify streaming history. As a Brazilian Zouk DJ and avid music junkie, I spend a lot of time discovering and listening to new music. Although Spotify is not the only music streaming service I use, it is still my favorite music platform for discovering new music and building personalized playlists.
While this is not an article about Brazilian Zouk, if you would like to learn about it, you can start here. This video also offers a short glimpse of what Brazilian Zouk partner dancing looks like. Notice the variety of music danced to in the short video.
While COVID-19 may have put a dent in Zouk activities recently, the scene here in the US and around the world is strong and growing.
This article was also inspired by a similar article I came across on Towards Data Science. You can check it out here.
To get my Spotify data, I had to request it via the privacy settings of my personal Spotify account. Check out this link for detailed instructions. When I requested my data, it said to wait a few days, luckily I received it much earlier than expected. Spotify sends you a zipped file with 9 json files containing data on everything from people you follow to details about your music libraries. For this analysis I’m only interested in my streaming history.
I’ll start by loading the necessary packages I’ll need to wrangle and visualize the data.
Prior to this, I had never worked with json files. Luckily the jsonlite library makes it really easy.
My streaming history came in 2 separate files, so I’ll import both files separately and combine them.
# Load first file
stream_hist_file_0 <- fromJSON(txt = "Raw_Data/StreamingHistory0.json", flatten = TRUE)
# Load second File
stream_hist_file_1 <- fromJSON(txt = "Raw_Data/StreamingHistory1.json", flatten = TRUE)
# Combine both files
stream_hist_tbl <-
rbind(stream_hist_file_1, stream_hist_file_0) %>%
as_tibble()| endTime | artistName | trackName | msPlayed |
|---|---|---|---|
| 2020-10-29 13:11 | J Balvin | LA CANCIÓN | 5610 |
| 2020-10-29 13:11 | J Balvin | LA CANCIÓN | 6421 |
| 2020-10-29 13:11 | Architrackz | Gitaren Body | 34901 |
| 2020-10-29 13:11 | Afoba Boyz | Gangsta | 2432 |
| 2020-10-29 13:11 | Noah Lunsi | Mamacita | 960 |
| 2020-10-29 13:11 | Noah Lunsi | Mamacita | 1514 |
The data contains 4 features:
endTime: date and time of when the stream ended.
artistName: name of the artist…duh.
trackName: title of music track or name of video.
msPlayed: how many mili-seconds the track was listened.
To work with endTime, I’ll need convert the data type from chr to date. I’ll also create new date, minutes and seconds columns.
stream_hist_tbl_2 <-
stream_hist_tbl %>%
# Convert endTime to date format
mutate_at("endTime", ymd_hm) %>% with_tz("US/Eastern") %>%
# Create minutes and seconds columns from the msPlayed feature
mutate(date = floor_date(endTime, "day") %>% as_date,
seconds = round(msPlayed / 1000, 2),
minutes = round(seconds / 60, 2)) %>%
mutate(trackName = case_when(
trackName == "Don't Fuck With Me (feat. Era Wadi & CANCUN?)" ~
"Don't F*$k With Me (feat. Era Wadi & CANCUN?)",
TRUE ~ trackName
))
head(stream_hist_tbl_2)| endTime | artistName | trackName | msPlayed | date | seconds | minutes |
|---|---|---|---|---|---|---|
| 2020-10-29 09:11:00 | J Balvin | LA CANCIÓN | 5610 | 2020-10-29 | 5.61 | 0.09 |
| 2020-10-29 09:11:00 | J Balvin | LA CANCIÓN | 6421 | 2020-10-29 | 6.42 | 0.11 |
| 2020-10-29 09:11:00 | Architrackz | Gitaren Body | 34901 | 2020-10-29 | 34.90 | 0.58 |
| 2020-10-29 09:11:00 | Afoba Boyz | Gangsta | 2432 | 2020-10-29 | 2.43 | 0.04 |
| 2020-10-29 09:11:00 | Noah Lunsi | Mamacita | 960 | 2020-10-29 | 0.96 | 0.02 |
| 2020-10-29 09:11:00 | Noah Lunsi | Mamacita | 1514 | 2020-10-29 | 1.51 | 0.03 |
Now I have the data in the format I want. I can now begin to explore.
# First day
first_day <- min(stream_hist_tbl_2$endTime)
# Last day
last_day <- max(stream_hist_tbl_2$endTime)
last_day - first_day## Time difference of 366.3014 days
I have streaming history for an entire year from February 27, 2020 to February 27, 2021.
# Count the number of unique artists
stream_hist_tbl_2 %>%
select(artistName) %>%
distinct() %>%
count()| n |
|---|
| 2740 |
This may seem like a huge number, however due to COVID-19, there have been no in person Zouk activities since March 2020. As a result, I have not spent as much time as I usually do discovering new music and creating new playlists.
stream_hist_tbl_2 %>%
group_by(artistName) %>%
summarise(count = n()) %>%
ungroup() %>%
arrange(desc(count)) %>%
mutate(artistName = artistName %>% as_factor() %>% fct_rev()) %>%
slice(1:20) %>%
# Aesthetics
ggplot(aes(count, artistName))+
geom_point(size = 2, color = "black")+
geom_segment(aes(x = 0, xend = count,
y = artistName, yend = artistName))+
# Labels
geom_label(aes(label = count), hjust = "inward", size = 3)+
# Themes
theme(axis.text.y = element_text(size = 8))+
theme_classic()+
# Formatting
scale_x_continuous(expand = c(0.02, 0.02), limits = c(0, 450))+
labs(title = "Top 20 Artists Streamed",
subtitle = "Number of Times Streamed",
y = "",
x = "\nNumber of Times Streamed")If you didn’t already know, I do love Dalex. This was interesting to see. Before looking at this data if you had asked me who my most streamed artist was, I probably would have said Sickick.
Now I want to take a different approach. Recall msPlayed stands for how many mili-seconds the track was listened to. Meaning in some cases, I may not have listened to the entire track. Sometimes I start playing a track, then click “Go to radio” so that Spotify uses the current track playing, to create a playlist with similar tracks, then I skip though the songs in the new playlist to find what sounds good to me.
Now I’m going to filter for songs I streamed for more than 3 minutes meaning I listened to most of, if not the whole song.
stream_hist_tbl_2 %>%
filter(minutes > 3.0) %>%
group_by(artistName) %>%
summarise(count = n()) %>%
ungroup() %>%
arrange(desc(count)) %>%
mutate(artistName = artistName %>% as_factor() %>% fct_rev()) %>%
slice(1:20) %>%
# Aesthetics
ggplot(aes(count, artistName))+
geom_point(size = 2, color = "black")+
geom_segment(aes(x = 0, xend = count,
y = artistName, yend = artistName))+
# Labels
geom_label(aes(label = count), hjust = "inward", size = 3)+
# Themes
theme(axis.text.y = element_text(size = 8))+
theme_classic()+
# Formatting
scale_x_continuous(expand = c(0.02, 0.02), limits = c(0, 115))+
labs(title = "Top 20 Artists Streamed",
subtitle = "Number of Times Streamed Longer Than 3 Minutes",
y = "",
x = "\nNumber of Times Streamed")Not surprisingly, Dalex, Darell and Justin Quiles are still the top 3.
# Dataframe of songs longer than 3 minutes
stream_hist_3_mins <-
stream_hist_tbl_2 %>%
filter(minutes >= 3.0)
# Function to filter for artist
func_artist_filter <- function(dataframe, artist_name){
artist_name_quosure <- rlang::enquo(artist_name)
dataframe %>%
filter(artistName == !! artist_name_quosure) %>%
group_by(artistName, trackName) %>%
summarise(count = n()) %>%
ungroup() %>%
arrange(desc(count)) %>%
slice(1:5) %>%
mutate(trackName = trackName %>% as_factor() %>% fct_rev())
}
# Filter for each of the top 3 artists
dalex_top_songs <- func_artist_filter(dataframe = stream_hist_3_mins,
artist_name = "Dalex")
darell_top_songs <- func_artist_filter(dataframe = stream_hist_3_mins,
artist_name = "Darell")
justin_top_songs <- func_artist_filter(dataframe = stream_hist_3_mins,
artist_name = "Justin Quiles")
# Combine top songs
rbind(justin_top_songs, darell_top_songs, dalex_top_songs) %>%
# Aesthetics
ggplot(aes(count, trackName, color = artistName))+
geom_point(size = 2)+
geom_segment(aes(x = 0, xend = count,
y = trackName, yend = trackName))+
# Labels
geom_label(aes(label = count), hjust = "inward", size = 4)+
# Theme & formatting
theme_classic()+
labs(title = "Top Songs by Top 3 Artists",
subtitle = "Number of Times Streamed",
x = "\nNumber of Times Streamed",
y = "",
fill = "Artist Name")Clearly I’ve been on a roll with Velitas, listening to the song 58 times since I first discovered it on September 20, 2020. In reality you can say I’ve listened to the entire song 115 times. My total minutes streamed for the song is 396. The song is 3.45 minutes long. 396/3.45 = 115. Additionally, it’s been 166 days since since I first heard the song, 155/166 = 0.9. You can also say I’ve listened to Velitas almost once a day since September 20, 2020 If you’re reading and you haven’t already, yeah, you should probably listen to Velitas.
# Get weekly streaming data
hours_streamed_tbl <-
stream_hist_tbl_2 %>%
group_by(date) %>%
group_by(date = floor_date(date, "week")) %>%
summarise(hours_streamed = sum(minutes)/60) %>%
arrange(date)
# Plot
hours_streamed_tbl %>%
ggplot(aes(date, hours_streamed))+
geom_col(aes(fill = hours_streamed))+
geom_hline(yintercept = 6.0, size = 0.2)+
theme_classic()+
scale_fill_viridis_c(option = "magma", direction = -1)+
labs(title = "Hours Streamed - Weekly",
x = "\nDate (Weekly)",
y = "Hours Streamed",
fill = "Hours Streamed")There are about 7 weeks that stand out with hours streamed above 6 hours. Usually when I discover a new song, I’ll probably be listening to it for the entire week. for the weeks of September 20, 2020 and September 27, 2020, we already know what I was listening to…..Velitas!. Let’s see what other songs I was streaming in other weeks.
Week of April 26, 2020:
# Create a function to filter stream history for top weeks and filter for top 5 songs streamed
func_top_songs_top_weeks <- function(dataframe, week_of){
week_quosure <- rlang::enquo(week_of)
dataframe %>%
filter(date == !! week_of) %>%
group_by(date) %>%
group_by(date = floor_date(date, "week"), artistName, trackName) %>%
summarise(hours_streamed = round(sum(minutes)/60, 2)) %>%
ungroup() %>%
slice_max(hours_streamed, n = 5) %>%
select(-date)
}
# Top songs, week of 2020-04-26
func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-04-26")| artistName | trackName | hours_streamed |
|---|---|---|
| Justin Quiles | DJ No Pare | 0.40 |
| Dalex | Bellaquita | 0.30 |
| Dalex | Vuelva A Ver - Remix | 0.23 |
| Justin Quiles | Comerte A Besos | 0.15 |
| J Balvin | COMO UN BEBÉ | 0.12 |
| Piso 21 | Mami | 0.12 |
Week of June 14, 2020:
# Top songs, week of 2020-06-14
func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-06-14")| artistName | trackName | hours_streamed |
|---|---|---|
| LAYNE | Midnight | 0.21 |
| Michael Kiwanuka | Love & Hate | 0.12 |
| Jessie Ware | First Time | 0.11 |
| Ya Levis | Ça ne me touche pas | 0.10 |
| Ne-Yo | U 2 Luv | 0.08 |
| RHODES | Let It All Go | 0.08 |
Week of July 12, 2020:
# Top songs, week of 2020-07-12
func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-07-12")| artistName | trackName | hours_streamed |
|---|---|---|
| Busy Signal | Jamaica Jamaica | 0.52 |
| Ne-Yo | U 2 Luv | 0.27 |
| Rationale | Hurts the Most | 0.26 |
| HEDEGAARD | JUMANJI (feat. CANCUN?) | 0.20 |
| Foy | Give Me the Night | 0.17 |
| HEDEGAARD | Don’t F*$k With Me (feat. Era Wadi & CANCUN?) | 0.17 |
Week of September 20, 2020:
# Top songs, week of 2020-09-20
func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-09-20")| artistName | trackName | hours_streamed |
|---|---|---|
| Darell | Velitas | 0.49 |
| Sech | Boomerang | 0.42 |
| Justin Quiles | DJ No Pare | 0.17 |
| INNA | Read My Lips | 0.12 |
| Mika Mendes | Tell Me Baby (feat. Chachi Carvalho) | 0.10 |
Week of September 27, 2020:
# Top songs, week of 2020-09-27
func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-09-27")| artistName | trackName | hours_streamed |
|---|---|---|
| Justin Quiles | DJ No Pare | 0.49 |
| J Balvin | ODIO | 0.35 |
| Sofco | KNFKKRFK | 0.28 |
| Darell | Velitas | 0.26 |
| Mau y Ricky | BOTA FUEGO | 0.20 |
Week of October 18, 2020:
# Top songs, week of 2020-10-18
func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-10-18")| artistName | trackName | hours_streamed |
|---|---|---|
| Llane | Será | 0.36 |
| Darell | Velitas | 0.22 |
| Maluma | Parce (feat. Justin Quiles) | 0.13 |
| Sergio | Tropa | 0.07 |
| Foy | Give Me the Night | 0.04 |
Week of January 03, 2021:
# Top songs, week of 2021-01-03
func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2021-01-04")| artistName | trackName | hours_streamed |
|---|---|---|
| Quique | Candela | 0.40 |
| Kelly Kiara | Set Me Up | 0.27 |
| Dan + Shay | Tequila | 0.20 |
| Haley Smalls | Lie Pon Mi | 0.11 |
| Ginette Claudette | Love Me Back | 0.10 |
# Hourly Streaming Heatmap
stream_hist_tbl_2 %>%
mutate(year = year(date),
month = month(date, label = TRUE),
day = day(date),
hour = hour(endTime)) %>%
select(-endTime, -artistName, -trackName) %>%
group_by(date, hour) %>%
summarise(minutes_streamed = sum(minutes)) %>%
ungroup() %>%
# Aesthetics
ggplot(aes(hour, date, fill = minutes_streamed))+
geom_tile()+
# Theme
theme_classic()+
# Formatting
scale_x_continuous(breaks = seq(0, 23, by = 1))+
scale_fill_viridis_c(option = "magma", direction = -1)+
labs(title = "Hourly Streaming Trend by Months",
x = "Hour of Day", y = "Month\n", fill = "Minutes Streamed")I have streaming activity throughout the day from 7am till about 11pm, with heavier activity from about 5pm.
stream_hist_tbl_2 %>%
mutate(year = year(date),
month = month(date, label = TRUE),
day = wday(date, label = TRUE),
hour = hour(endTime)) %>%
select(-endTime, -artistName, -trackName) %>%
group_by(day, hour) %>%
summarise(minutes_streamed = sum(minutes)) %>%
ungroup() %>%
mutate(day = day %>% as_factor()) %>%
# Aesthetics
ggplot(aes(hour, day, fill = minutes_streamed))+
geom_tile()+
# Themes
theme_classic()+
# Formatting
scale_x_continuous(breaks = seq(0, 23, by = 1))+
scale_fill_viridis_c(option = "magma", direction = -1)+
labs(title = "Hourly Streaming Trend by Day of Week",
x = "Hour of Day", y = "Day of Week\n", fill = "Minutes Streamed")More streaming activity on Saturdays and Sundays from about 10am, and also on Mondays from 5pm.
Thanks for reading Part 1. In Part 2, I’ll be exploring the spotifyr package. This package lets you connect to the Spotify API and extract some more interesting data about tracks.
Feel free to follow me on Spotify here. You can also check out some of my playlists below: