After recently completing DS4B 101: Business Analysis With R from Business Science, I’m feeling like a certified Tidyverse ninja and decided to take on a mini project of my own, analyzing my Spotify streaming history. As a Brazilian Zouk DJ and avid music junkie, I spend a lot of time discovering and listening to new music. Although Spotify is not the only music streaming service I use, it is still my favorite music platform for discovering new music and building personalized playlists.

While this is not an article about Brazilian Zouk, if you would like to learn about it, you can start here. This video also offers a short glimpse of what Brazilian Zouk partner dancing looks like. Notice the variety of music danced to in the short video.

While COVID-19 may have put a dent in Zouk activities recently, the scene here in the US and around the world is strong and growing.

This article was also inspired by a similar article I came across on Towards Data Science. You can check it out here.

Getting My Spotify Data

To get my Spotify data, I had to request it via the privacy settings of my personal Spotify account. Check out this link for detailed instructions. When I requested my data, it said to wait a few days, luckily I received it much earlier than expected. Spotify sends you a zipped file with 9 json files containing data on everything from people you follow to details about your music libraries. For this analysis I’m only interested in my streaming history.

Setting Up

I’ll start by loading the necessary packages I’ll need to wrangle and visualize the data.

# Libraries

library(jsonlite)
library(tidyverse)
library(lubridate)
library(printr)

Prior to this, I had never worked with json files. Luckily the jsonlite library makes it really easy.

Loading Data

My streaming history came in 2 separate files, so I’ll import both files separately and combine them.

# Load first file
stream_hist_file_0 <- fromJSON(txt = "Raw_Data/StreamingHistory0.json", flatten = TRUE)

# Load second File
stream_hist_file_1 <- fromJSON(txt = "Raw_Data/StreamingHistory1.json", flatten = TRUE)

# Combine both files
stream_hist_tbl <- 
    rbind(stream_hist_file_1, stream_hist_file_0) %>% 
    as_tibble()
head(stream_hist_tbl)
endTime artistName trackName msPlayed
2020-10-29 13:11 J Balvin LA CANCIÓN 5610
2020-10-29 13:11 J Balvin LA CANCIÓN 6421
2020-10-29 13:11 Architrackz Gitaren Body 34901
2020-10-29 13:11 Afoba Boyz Gangsta 2432
2020-10-29 13:11 Noah Lunsi Mamacita 960
2020-10-29 13:11 Noah Lunsi Mamacita 1514

The data contains 4 features:

endTime: date and time of when the stream ended.

artistName: name of the artist…duh.

trackName: title of music track or name of video.

msPlayed: how many mili-seconds the track was listened.

To work with endTime, I’ll need convert the data type from chr to date. I’ll also create new date, minutes and seconds columns.

stream_hist_tbl_2 <- 
    stream_hist_tbl %>% 
    
    # Convert endTime to date format
    mutate_at("endTime", ymd_hm) %>% with_tz("US/Eastern") %>% 
    
    # Create minutes and seconds columns from the msPlayed feature
    mutate(date = floor_date(endTime, "day") %>% as_date, 
           seconds = round(msPlayed / 1000, 2), 
           minutes = round(seconds / 60, 2)) %>% 
    mutate(trackName = case_when(
      trackName == "Don't Fuck With Me (feat. Era Wadi & CANCUN?)" ~ 
        "Don't F*$k With Me (feat. Era Wadi & CANCUN?)",
      TRUE ~ trackName
    ))

head(stream_hist_tbl_2)
endTime artistName trackName msPlayed date seconds minutes
2020-10-29 09:11:00 J Balvin LA CANCIÓN 5610 2020-10-29 5.61 0.09
2020-10-29 09:11:00 J Balvin LA CANCIÓN 6421 2020-10-29 6.42 0.11
2020-10-29 09:11:00 Architrackz Gitaren Body 34901 2020-10-29 34.90 0.58
2020-10-29 09:11:00 Afoba Boyz Gangsta 2432 2020-10-29 2.43 0.04
2020-10-29 09:11:00 Noah Lunsi Mamacita 960 2020-10-29 0.96 0.02
2020-10-29 09:11:00 Noah Lunsi Mamacita 1514 2020-10-29 1.51 0.03

Now I have the data in the format I want. I can now begin to explore.

What’s the date range of the data?

# First day
first_day <- min(stream_hist_tbl_2$endTime)

# Last day
last_day <- max(stream_hist_tbl_2$endTime)

last_day - first_day
## Time difference of 366.3014 days

I have streaming history for an entire year from February 27, 2020 to February 27, 2021.

How many unique artists did I listen to in the past year?

# Count the number of unique artists

stream_hist_tbl_2 %>% 
    select(artistName) %>% 
    distinct() %>% 
    count()
n
2740

This may seem like a huge number, however due to COVID-19, there have been no in person Zouk activities since March 2020. As a result, I have not spent as much time as I usually do discovering new music and creating new playlists.

What artists did I listen to the most in the past year?

stream_hist_tbl_2 %>% 
    group_by(artistName) %>% 
    summarise(count = n()) %>% 
    ungroup() %>% 
    arrange(desc(count)) %>% 
    mutate(artistName = artistName %>% as_factor() %>% fct_rev()) %>% 
    slice(1:20) %>% 
    
    # Aesthetics
    ggplot(aes(count, artistName))+
    geom_point(size = 2, color = "black")+
    geom_segment(aes(x = 0, xend = count, 
                     y = artistName, yend = artistName))+
    
    # Labels
    geom_label(aes(label = count), hjust = "inward", size = 3)+
    
    # Themes
    theme(axis.text.y = element_text(size = 8))+
    theme_classic()+
    
    # Formatting
    scale_x_continuous(expand = c(0.02, 0.02), limits = c(0, 450))+
    labs(title = "Top 20 Artists Streamed",
         subtitle = "Number of Times Streamed",
         y = "",
         x = "\nNumber of Times Streamed")


If you didn’t already know, I do love Dalex. This was interesting to see. Before looking at this data if you had asked me who my most streamed artist was, I probably would have said Sickick.

Now I want to take a different approach. Recall msPlayed stands for how many mili-seconds the track was listened to. Meaning in some cases, I may not have listened to the entire track. Sometimes I start playing a track, then click “Go to radio” so that Spotify uses the current track playing, to create a playlist with similar tracks, then I skip though the songs in the new playlist to find what sounds good to me.

Now I’m going to filter for songs I streamed for more than 3 minutes meaning I listened to most of, if not the whole song.

stream_hist_tbl_2 %>% 
    filter(minutes > 3.0) %>% 
    group_by(artistName) %>% 
    summarise(count = n()) %>% 
    ungroup() %>% 
    arrange(desc(count)) %>% 
    mutate(artistName = artistName %>% as_factor() %>% fct_rev()) %>% 
    slice(1:20) %>% 
    
    # Aesthetics
    ggplot(aes(count, artistName))+
    geom_point(size = 2, color = "black")+
    geom_segment(aes(x = 0, xend = count, 
                     y = artistName, yend = artistName))+
    
    # Labels
    geom_label(aes(label = count), hjust = "inward", size = 3)+
    
    # Themes
    theme(axis.text.y = element_text(size = 8))+
    theme_classic()+
    
    # Formatting
    scale_x_continuous(expand = c(0.02, 0.02), limits = c(0, 115))+
    labs(title = "Top 20 Artists Streamed",
         subtitle = "Number of Times Streamed Longer Than 3 Minutes",
         y = "",
         x = "\nNumber of Times Streamed")


Not surprisingly, Dalex, Darell and Justin Quiles are still the top 3.

What are my most streamed songs from these 3 artists?

# Dataframe of songs longer than 3 minutes

stream_hist_3_mins <- 
    stream_hist_tbl_2 %>% 
    filter(minutes >= 3.0)

# Function to filter for artist
func_artist_filter <- function(dataframe, artist_name){
    
     artist_name_quosure <- rlang::enquo(artist_name)
     
     dataframe %>% 
        filter(artistName == !! artist_name_quosure) %>% 
        group_by(artistName, trackName) %>% 
        summarise(count = n()) %>% 
        ungroup() %>% 
        arrange(desc(count)) %>% 
        slice(1:5) %>% 
        mutate(trackName = trackName %>% as_factor() %>% fct_rev())
}

# Filter for each of the top 3 artists
dalex_top_songs <- func_artist_filter(dataframe = stream_hist_3_mins,
                                      artist_name = "Dalex")

darell_top_songs <- func_artist_filter(dataframe = stream_hist_3_mins,
                                      artist_name = "Darell")

justin_top_songs <- func_artist_filter(dataframe = stream_hist_3_mins,
                                      artist_name = "Justin Quiles")

# Combine top songs
rbind(justin_top_songs, darell_top_songs, dalex_top_songs) %>% 
    
    # Aesthetics
    ggplot(aes(count, trackName, color = artistName))+
    geom_point(size = 2)+
    geom_segment(aes(x = 0, xend = count, 
                     y = trackName, yend = trackName))+
    
    # Labels
    geom_label(aes(label = count), hjust = "inward", size = 4)+
    
    # Theme & formatting
    theme_classic()+
    labs(title = "Top Songs by Top 3 Artists",
         subtitle = "Number of Times Streamed",
         x = "\nNumber of Times Streamed", 
         y = "", 
         fill = "Artist Name")


Clearly I’ve been on a roll with Velitas, listening to the song 58 times since I first discovered it on September 20, 2020. In reality you can say I’ve listened to the entire song 115 times. My total minutes streamed for the song is 396. The song is 3.45 minutes long. 396/3.45 = 115. Additionally, it’s been 166 days since since I first heard the song, 155/166 = 0.9. You can also say I’ve listened to Velitas almost once a day since September 20, 2020 If you’re reading and you haven’t already, yeah, you should probably listen to Velitas.

Weekly streaming pattern since February 27, 2020

# Get weekly streaming data
hours_streamed_tbl <- 
    stream_hist_tbl_2 %>% 
    group_by(date) %>% 
    group_by(date = floor_date(date, "week")) %>% 
    summarise(hours_streamed = sum(minutes)/60) %>% 
    arrange(date) 

#  Plot
hours_streamed_tbl %>% 
    ggplot(aes(date, hours_streamed))+
    geom_col(aes(fill = hours_streamed))+
    geom_hline(yintercept = 6.0, size = 0.2)+
    theme_classic()+
    scale_fill_viridis_c(option = "magma", direction = -1)+
    labs(title = "Hours Streamed - Weekly",
         x = "\nDate (Weekly)",
         y = "Hours Streamed",
         fill = "Hours Streamed")


There are about 7 weeks that stand out with hours streamed above 6 hours. Usually when I discover a new song, I’ll probably be listening to it for the entire week. for the weeks of September 20, 2020 and September 27, 2020, we already know what I was listening to…..Velitas!. Let’s see what other songs I was streaming in other weeks.


Week of April 26, 2020:

# Create a function to filter stream history for top weeks and filter for top 5 songs streamed

func_top_songs_top_weeks <- function(dataframe, week_of){
    
    week_quosure <- rlang::enquo(week_of)
    
    dataframe %>% 
        filter(date == !! week_of) %>% 
        group_by(date) %>% 
        group_by(date = floor_date(date, "week"), artistName, trackName) %>% 
        summarise(hours_streamed = round(sum(minutes)/60, 2)) %>% 
        ungroup() %>% 
        slice_max(hours_streamed, n = 5) %>% 
        select(-date)
}

# Top songs, week of 2020-04-26
func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-04-26")
artistName trackName hours_streamed
Justin Quiles DJ No Pare 0.40
Dalex Bellaquita 0.30
Dalex Vuelva A Ver - Remix 0.23
Justin Quiles Comerte A Besos 0.15
J Balvin COMO UN BEBÉ 0.12
Piso 21 Mami 0.12

Week of June 14, 2020:

# Top songs, week of 2020-06-14

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-06-14")
artistName trackName hours_streamed
LAYNE Midnight 0.21
Michael Kiwanuka Love & Hate 0.12
Jessie Ware First Time 0.11
Ya Levis Ça ne me touche pas 0.10
Ne-Yo U 2 Luv 0.08
RHODES Let It All Go 0.08

Week of July 12, 2020:

# Top songs, week of 2020-07-12

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-07-12")
artistName trackName hours_streamed
Busy Signal Jamaica Jamaica 0.52
Ne-Yo U 2 Luv 0.27
Rationale Hurts the Most 0.26
HEDEGAARD JUMANJI (feat. CANCUN?) 0.20
Foy Give Me the Night 0.17
HEDEGAARD Don’t F*$k With Me (feat. Era Wadi & CANCUN?) 0.17

Week of September 20, 2020:

# Top songs, week of 2020-09-20

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-09-20")
artistName trackName hours_streamed
Darell Velitas 0.49
Sech Boomerang 0.42
Justin Quiles DJ No Pare 0.17
INNA Read My Lips 0.12
Mika Mendes Tell Me Baby (feat. Chachi Carvalho) 0.10

Week of September 27, 2020:

# Top songs, week of 2020-09-27

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-09-27")
artistName trackName hours_streamed
Justin Quiles DJ No Pare 0.49
J Balvin ODIO 0.35
Sofco KNFKKRFK 0.28
Darell Velitas 0.26
Mau y Ricky BOTA FUEGO 0.20

Week of October 18, 2020:

# Top songs, week of 2020-10-18

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2020-10-18")
artistName trackName hours_streamed
Llane Será 0.36
Darell Velitas 0.22
Maluma Parce (feat. Justin Quiles) 0.13
Sergio Tropa 0.07
Foy Give Me the Night 0.04

Week of January 03, 2021:

# Top songs, week of 2021-01-03

func_top_songs_top_weeks(dataframe = stream_hist_tbl_2, week_of = "2021-01-04")
artistName trackName hours_streamed
Quique Candela 0.40
Kelly Kiara Set Me Up 0.27
Dan + Shay Tequila 0.20
Haley Smalls Lie Pon Mi 0.11
Ginette Claudette Love Me Back 0.10

Houly Streaming Pattern - Monthly

# Hourly Streaming Heatmap

stream_hist_tbl_2 %>% 
    mutate(year = year(date),
           month = month(date, label = TRUE),
           day = day(date), 
           hour = hour(endTime)) %>% 
    select(-endTime, -artistName, -trackName) %>% 
    group_by(date, hour) %>% 
    summarise(minutes_streamed = sum(minutes)) %>% 
    ungroup() %>% 
  
    # Aesthetics
    ggplot(aes(hour, date, fill = minutes_streamed))+
    geom_tile()+
    
    # Theme
    theme_classic()+
  
    # Formatting
    scale_x_continuous(breaks = seq(0, 23, by = 1))+
    scale_fill_viridis_c(option = "magma", direction = -1)+
    labs(title = "Hourly Streaming Trend by Months",
         x = "Hour of Day", y = "Month\n", fill = "Minutes Streamed")


I have streaming activity throughout the day from 7am till about 11pm, with heavier activity from about 5pm.

Hourly Streaming Pattern - Day of Week

stream_hist_tbl_2 %>% 
    mutate(year = year(date),
           month = month(date, label = TRUE),
           day = wday(date, label = TRUE), 
           hour = hour(endTime)) %>% 
    select(-endTime, -artistName, -trackName) %>% 
    group_by(day, hour) %>% 
    summarise(minutes_streamed = sum(minutes)) %>% 
    ungroup() %>% 
    mutate(day = day %>% as_factor()) %>% 
  
    # Aesthetics
    ggplot(aes(hour, day, fill = minutes_streamed))+
    geom_tile()+
  
    # Themes
    theme_classic()+
  
    # Formatting
    scale_x_continuous(breaks = seq(0, 23, by = 1))+
    scale_fill_viridis_c(option = "magma", direction = -1)+
    labs(title = "Hourly Streaming Trend by Day of Week",
         x = "Hour of Day", y = "Day of Week\n", fill = "Minutes Streamed")


More streaming activity on Saturdays and Sundays from about 10am, and also on Mondays from 5pm.

Thanks for reading Part 1. In Part 2, I’ll be exploring the spotifyr package. This package lets you connect to the Spotify API and extract some more interesting data about tracks.

Feel free to follow me on Spotify here. You can also check out some of my playlists below:

Zoukables Jan 2021

Zouk Nov 2019

Latin Mood