Analyzing Spotify Stream History of a Brazilian Zouk DJ Using R

After recently completing DS4B 101: Business Analysis With R from Business Science, I’m feeling like a certified Tidyverse ninja and decided to take on a mini project of my own, analyzing my Spotify streaming history. As a Brazilian Zouk DJ and avid music junkie, I spend a lot of time discovering and listening to new music. Although Spotify is not the only music streaming service I use, it is still my favorite music platform for discovering new music and building personalized playlists.

While this is not an article about Brazilian Zouk, if you would like to learn about it, you can start here. This video also offers a short glimpse of what Brazilian Zouk partner dancing looks like. Notice the variety of music danced to in the short video.

While COVID-19 may have put a dent in Zouk activities recently, the scene here in the US and around the world is strong and growing.

This article was also inspired by a similar article I came across on Towards Data Science. You can check it out here.

Getting My Spotify Data

To get my Spotify data, I had to request it via the privacy settings of my personal Spotify account. Check out this link for detailed instructions. When I requested my data, it said to wait a few days, luckily I received it much earlier than expected. Spotify sends you a zipped file with 9 json files containing data on everything from people you follow to details about your music libraries. For this analysis I’m only interested in my streaming history.

Setting Up

I’ll start by loading the necessary packages I’ll need to wrangle and visualize the data.

# Libraries

library(jsonlite)
library(tidyverse)
library(lubridate)
library(printr)

Prior to this, I had never worked with json files. Luckily the jsonlite library makes it really easy.

Loading Data

My streaming history came in 2 separate files, so I’ll import both files separately and combine them.

# Load first file
stream_hist_file_0 <- fromJSON(txt = "Raw_Data/StreamingHistory0.json", flatten = TRUE)

# Load second File
stream_hist_file_1 <- fromJSON(txt = "Raw_Data/StreamingHistory1.json", flatten = TRUE)

# Combine both files
stream_hist_tbl <- 
    rbind(stream_hist_file_1, stream_hist_file_0) %>% 
    as_tibble()

head(stream_hist_tbl)

endTime	artistName	trackName	msPlayed
2020-10-29 13:11	J Balvin	LA CANCIÓN	5610
2020-10-29 13:11	J Balvin	LA CANCIÓN	6421
2020-10-29 13:11	Architrackz	Gitaren Body	34901
2020-10-29 13:11	Afoba Boyz	Gangsta	2432
2020-10-29 13:11	Noah Lunsi	Mamacita	960
2020-10-29 13:11	Noah Lunsi	Mamacita	1514

The data contains 4 features:

endTime: date and time of when the stream ended.

artistName: name of the artist…duh.

trackName: title of music track or name of video.

msPlayed: how many mili-seconds the track was listened.

To work with endTime, I’ll need convert the data type from chr to date. I’ll also create new date, minutes and seconds columns.

stream_hist_tbl_2 <- 
    stream_hist_tbl %>% 
    
    # Convert endTime to date format
    mutate_at("endTime", ymd_hm) %>% with_tz("US/Eastern") %>% 
    
    # Create minutes and seconds columns from the msPlayed feature
    mutate(date = floor_date(endTime, "day") %>% as_date, 
           seconds = round(msPlayed / 1000, 2), 
           minutes = round(seconds / 60, 2)) %>% 
    mutate(trackName = case_when(
      trackName == "Don't Fuck With Me (feat. Era Wadi & CANCUN?)" ~ 
        "Don't F*$k With Me (feat. Era Wadi & CANCUN?)",
      TRUE ~ trackName
    ))

head(stream_hist_tbl_2)

endTime	artistName	trackName	msPlayed	date	seconds	minutes
2020-10-29 09:11:00	J Balvin	LA CANCIÓN	5610	2020-10-29	5.61	0.09
2020-10-29 09:11:00	J Balvin	LA CANCIÓN	6421	2020-10-29	6.42	0.11
2020-10-29 09:11:00	Architrackz	Gitaren Body	34901	2020-10-29	34.90	0.58
2020-10-29 09:11:00	Afoba Boyz	Gangsta	2432	2020-10-29	2.43	0.04
2020-10-29 09:11:00	Noah Lunsi	Mamacita	960	2020-10-29	0.96	0.02
2020-10-29 09:11:00	Noah Lunsi	Mamacita	1514	2020-10-29	1.51	0.03

Now I have the data in the format I want. I can now begin to explore.

What’s the date range of the data?

# First day
first_day <- min(stream_hist_tbl_2$endTime)

# Last day
last_day <- max(stream_hist_tbl_2$endTime)

last_day - first_day

## Time difference of 366.3014 days

I have streaming history for an entire year from February 27, 2020 to February 27, 2021.

How many unique artists did I listen to in the past year?

# Count the number of unique artists

stream_hist_tbl_2 %>% 
    select(artistName) %>% 
    distinct() %>% 
    count()

n
2740

This may seem like a huge number, however due to COVID-19, there have been no in person Zouk activities since March 2020. As a result, I have not spent as much time as I usually do discovering new music and creating new playlists.

What artists did I listen to the most in the past year?

stream_hist_tbl_2 %>% 
    group_by(artistName) %>% 
    summarise(count = n()) %>% 
    ungroup() %>% 
    arrange(desc(count)) %>% 
    mutate(artistName = artistName %>% as_factor() %>% fct_rev()) %>% 
    slice(1:20) %>% 
    
    # Aesthetics
    ggplot(aes(count, artistName))+
    geom_point(size = 2, color = "black")+
    geom_segment(aes(x = 0, xend = count, 
                     y = artistName, yend = artistName))+
    
    # Labels
    geom_label(aes(label = count), hjust = "inward", size = 3)+
    
    # Themes
    theme(axis.text.y = element_text(size = 8))+
    theme_classic()+
    
    # Formatting
    scale_x_continuous(expand = c(0.02, 0.02), limits = c(0, 450))+
    labs(title = "Top 20 Artists Streamed",
         subtitle = "Number of Times Streamed",
         y = "",
         x = "\nNumber of Times Streamed")

If you didn’t already know, I do love Dalex. This was interesting to see. Before looking at this data if you had asked me who my most streamed artist was, I probably would have said Sickick.

Now I want to take a different approach. Recall msPlayed stands for how many mili-seconds the track was listened to. Meaning in some cases, I may not have listened to the entire track. Sometimes I start playing a track, then click “Go to radio” so that Spotify uses the current track playing, to create a playlist with similar tracks, then I skip though the songs in the new playlist to find what sounds good to me.

Now I’m going to filter for songs I streamed for more than 3 minutes meaning I listened to most of, if not the whole song.

stream_hist_tbl_2 %>% 
    filter(minutes > 3.0) %>% 
    group_by(artistName) %>% 
    summarise(count = n()) %>% 
    ungroup() %>% 
    arrange(desc(count)) %>% 
    mutate(artistName = artistName %>% as_factor() %>% fct_rev()) %>% 
    slice(1:20) %>% 
    
    # Aesthetics
    ggplot(aes(count, artistName))+
    geom_point(size = 2, color = "black")+
    geom_segment(aes(x = 0, xend = count, 
                     y = artistName, yend = artistName))+
    
    # Labels
    geom_label(aes(label = count), hjust = "inward", size = 3)+
    
    # Themes
    theme(axis.text.y = element_text(size = 8))+
    theme_classic()+
    
    # Formatting
    scale_x_continuous(expand = c(0.02, 0.02), limits = c(0, 115))+
    labs(title = "Top 20 Artists Streamed",
         subtitle = "Number of Times Streamed Longer Than 3 Minutes",
         y = "",
         x = "\nNumber of Times Streamed")

Not surprisingly, Dalex, Darell and Justin Quiles are still the top 3.

What are my most streamed songs from these 3 artists?

# Dataframe of songs longer than 3 minutes

stream_hist_3_mins <- 
    stream_hist_tbl_2 %>% 
    filter(minutes >= 3.0)

# Function to filter for artist
func_artist_filter <- function(dataframe, artist_name){
    
     artist_name_quosure <- rlang::enquo(artist_name)
     
     dataframe %>% 
        filter(artistName == !! artist_name_quosure) %>% 
        group_by(artistName, trackName) %>% 
        summarise(count = n()) %>% 
        ungroup() %>% 
        arrange(desc(count)) %>% 
        slice(1:5) %>% 
        mutate(trackName = trackName %>% as_factor() %>% fct_rev())
}

# Filter for each of the top 3 artists
dalex_top_songs <- func_artist_filter(dataframe = stream_hist_3_mins,
                                      artist_name = "Dalex")

darell_top_songs <- func_artist_filter(dataframe = stream_hist_3_mins,
                                      artist_name = "Darell")

justin_top_songs <- func_artist_filter(dataframe = stream_hist_3_mins,
                                      artist_name = "Justin Quiles")

# Combine top songs
rbind(justin_top_songs, darell_top_songs, dalex_top_songs) %>% 
    
    # Aesthetics
    ggplot(aes(count, trackName, color = artistName))+
    geom_point(size = 2)+
    geom_segment(aes(x = 0, xend = count, 
                     y = trackName, yend = trackName))+
    
    # Labels
    geom_label(aes(label = count), hjust = "inward", size = 4)+
    
    # Theme & formatting
    theme_classic()+
    labs(title = "Top Songs by Top 3 Artists",
         subtitle = "Number of Times Streamed",
         x = "\nNumber of Times Streamed", 
         y = "", 
         fill = "Artist Name")

Clearly I’ve been on a roll with Velitas, listening to the song 58 times since I first discovered it on September 20, 2020. In reality you can say I’ve listened to the entire song 115 times. My total minutes streamed for the song is 396. The song is 3.45 minutes long. 396/3.45 = 115. Additionally, it’s been 166 days since since I first heard the song, 155/166 = 0.9. You can also say I’ve listened to Velitas almost once a day since September 20, 2020 If you’re reading and you haven’t already, yeah, you should probably listen to Velitas.

Weekly streaming pattern since February 27, 2020

# Get weekly streaming data
hours_streamed_tbl <- 
    stream_hist_tbl_2 %>% 
    group_by(date) %>% 
    group_by(date = floor_date(date, "week")) %>% 
    summarise(hours_streamed = sum(minutes)/60) %>% 
    arrange(date) 

#  Plot
hours_streamed_tbl %>% 
    ggplot(aes(date, hours_streamed))+
    geom_col(aes(fill = hours_streamed))+
    geom_hline(yintercept = 6.0, size = 0.2)+
    theme_classic()+
    scale_fill_viridis_c(option = "magma", direction = -1)+
    labs(title = "Hours Streamed - Weekly",
         x = "\nDate (Weekly)",
         y = "Hours Streamed",
         fill = "Hours Streamed")

There are about 7 weeks that stand out with hours streamed above 6 hours. Usually when I discover a new song, I’ll probably be listening to it for the entire week. for the weeks of September 20, 2020 and September 27, 2020, we already know what I was listening to…..Velitas!. Let’s see what other songs I was streaming in other weeks.

Week of April 26, 2020:

# Create a function to filter stream history for top weeks and filter for top 5 songs streamed

func_top_songs_top_weeks <- function(dataframe, week_of){
    
    week_quosure <- rlang::enquo(week_of)
    
    dataframe %>% 
        filter(date == !! week_of) %>% 
        group_by(date) %>% 
        group_by(date = floor_date(date, "week"), artistName, trackName) %>% 
        summarise(hours_streamed = round(sum(minutes)/60, 2)) %>% 
        ungroup() %>% 
        slice_max(hours_streamed, n = 5) %>% 
        select(-date)
}

# Top songs, week of 2020-04-26
func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-04-26")

artistName	trackName	hours_streamed
Justin Quiles	DJ No Pare	0.40
Dalex	Bellaquita	0.30
Dalex	Vuelva A Ver - Remix	0.23
Justin Quiles	Comerte A Besos	0.15
J Balvin	COMO UN BEBÉ	0.12
Piso 21	Mami	0.12

Week of June 14, 2020:

# Top songs, week of 2020-06-14

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-06-14")

artistName	trackName	hours_streamed
LAYNE	Midnight	0.21
Michael Kiwanuka	Love & Hate	0.12
Jessie Ware	First Time	0.11
Ya Levis	Ça ne me touche pas	0.10
Ne-Yo	U 2 Luv	0.08
RHODES	Let It All Go	0.08

Week of July 12, 2020:

# Top songs, week of 2020-07-12

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-07-12")

artistName	trackName	hours_streamed
Busy Signal	Jamaica Jamaica	0.52
Ne-Yo	U 2 Luv	0.27
Rationale	Hurts the Most	0.26
HEDEGAARD	JUMANJI (feat. CANCUN?)	0.20
Foy	Give Me the Night	0.17
HEDEGAARD	Don’t F*$k With Me (feat. Era Wadi & CANCUN?)	0.17

Week of September 20, 2020:

# Top songs, week of 2020-09-20

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-09-20")

artistName	trackName	hours_streamed
Darell	Velitas	0.49
Sech	Boomerang	0.42
Justin Quiles	DJ No Pare	0.17
INNA	Read My Lips	0.12
Mika Mendes	Tell Me Baby (feat. Chachi Carvalho)	0.10

Week of September 27, 2020:

# Top songs, week of 2020-09-27

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-09-27")

artistName	trackName	hours_streamed
Justin Quiles	DJ No Pare	0.49
J Balvin	ODIO	0.35
Sofco	KNFKKRFK	0.28
Darell	Velitas	0.26
Mau y Ricky	BOTA FUEGO	0.20

Week of October 18, 2020:

# Top songs, week of 2020-10-18

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-10-18")

artistName	trackName	hours_streamed
Llane	Será	0.36
Darell	Velitas	0.22
Maluma	Parce (feat. Justin Quiles)	0.13
Sergio	Tropa	0.07
Foy	Give Me the Night	0.04

Week of January 03, 2021:

# Top songs, week of 2021-01-03

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2021-01-04")

artistName	trackName	hours_streamed
Quique	Candela	0.40
Kelly Kiara	Set Me Up	0.27
Dan + Shay	Tequila	0.20
Haley Smalls	Lie Pon Mi	0.11
Ginette Claudette	Love Me Back	0.10

Houly Streaming Pattern - Monthly

# Hourly Streaming Heatmap

stream_hist_tbl_2 %>% 
    mutate(year = year(date),
           month = month(date, label = TRUE),
           day = day(date), 
           hour = hour(endTime)) %>% 
    select(-endTime, -artistName, -trackName) %>% 
    group_by(date, hour) %>% 
    summarise(minutes_streamed = sum(minutes)) %>% 
    ungroup() %>% 
  
    # Aesthetics
    ggplot(aes(hour, date, fill = minutes_streamed))+
    geom_tile()+
    
    # Theme
    theme_classic()+
  
    # Formatting
    scale_x_continuous(breaks = seq(0, 23, by = 1))+
    scale_fill_viridis_c(option = "magma", direction = -1)+
    labs(title = "Hourly Streaming Trend by Months",
         x = "Hour of Day", y = "Month\n", fill = "Minutes Streamed")

I have streaming activity throughout the day from 7am till about 11pm, with heavier activity from about 5pm.

Hourly Streaming Pattern - Day of Week

stream_hist_tbl_2 %>% 
    mutate(year = year(date),
           month = month(date, label = TRUE),
           day = wday(date, label = TRUE), 
           hour = hour(endTime)) %>% 
    select(-endTime, -artistName, -trackName) %>% 
    group_by(day, hour) %>% 
    summarise(minutes_streamed = sum(minutes)) %>% 
    ungroup() %>% 
    mutate(day = day %>% as_factor()) %>% 
  
    # Aesthetics
    ggplot(aes(hour, day, fill = minutes_streamed))+
    geom_tile()+
  
    # Themes
    theme_classic()+
  
    # Formatting
    scale_x_continuous(breaks = seq(0, 23, by = 1))+
    scale_fill_viridis_c(option = "magma", direction = -1)+
    labs(title = "Hourly Streaming Trend by Day of Week",
         x = "Hour of Day", y = "Day of Week\n", fill = "Minutes Streamed")

More streaming activity on Saturdays and Sundays from about 10am, and also on Mondays from 5pm.

Thanks for reading Part 1. In Part 2, I’ll be exploring the spotifyr package. This package lets you connect to the Spotify API and extract some more interesting data about tracks.

Feel free to follow me on Spotify here. You can also check out some of my playlists below:

Zoukables Jan 2021

Zouk Nov 2019

Latin Mood