Analyzing Spotify Listening Patterns

DATA 110, Final Project

Author

Max Reed

Published

July 7, 2024

Introduction & About the Data

Our Research Topic

In this project, I aim to explore my dataset of my top 50 songs from the first half of 2024. The dataset is a .csv pulled directly from my spotify listening patterns broken down into variables which we will use to compare and note any trends within the data.

Variables

The dataset we will be working with was crafted based on my own listening patterns. The variables we will be working with in the dataset are as follows:

“Track Name” is the song name.
“Artist Name” is the name of the artist(s) who performed the song.
“Month” is which playlist the track was featured on. For example,
“Month Rank” is the rank the track had in a given month playlist. A lower number is a higher rank.
“Track Length (ms)” is the length of the song in milliseconds
“Year of Song Release” is the year the song was released to the public or originally recorded.
“Pop Score” is a score Spotify assigns to songs based on all user’s listens, giving it a popularity score. A lower number is a higher rank.
“Language” is the language a song performed in OR if the song is instrumental (without vocals/lyrics).

Background and About the Data

As a Spotify listener, I started using the music streaming service when I heard about their “Spotify Wrapped.” It’s a personalized summary and playlist of each user’s most listened-to tracks, albums, artists, and genres, given at the end of the year. However, I was disappointed to learn that Spotify Wrapped only uses music data from January through October (Bowenbank), meaning my November and December listens wouldn’t be included every year.

As a result, I began creating my own playlists at the end of each month. Using Spotify’s desktop app features, I craft playlists of my top 50 listened-to tracks each month and have been archiving them for purposes like this. You can get your own song data or playlist data converted to .csv using this Spotify API tool: https://exportify.net/. I then removed most unneeded columns for my data and compiled my individual month data into one master sheet for us to work with. In this project, we will be working with data from my January 2024 through June 2024 playlists. Ultimately, I chose this topic and dataset to get a deeper understanding of these playlists I’ve been creating and to get a better understanding of how parts of the official Spotify Wrapped really work.

Preparing the dataset for manipulation

# loading all necessary libraries
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# loading in dataset

setwd("~/24X Course Work/DATA110")

music <- read_csv("REED Spotify Library v1.3.csv")

Rows: 299 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Track name, Artist name, Album, Month, Genre, Language
dbl (6): Order, Month Rank, Track Length (ms), Year of Song Release, Time Si...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(music)

# A tibble: 6 × 12
  Order `Track name`  `Artist name` Album Month `Month Rank` `Track Length (ms)`
  <dbl> <chr>         <chr>         <chr> <chr>        <dbl>               <dbl>
1     1 Meet The Pla… Grey Henson   Mean… Janu…            1              257920
2     2 Tokyo Calling ATARASHII GA… Toky… Janu…            2              191180
3     3 Daughter of … TV Girl       Fren… Janu…            3              153440
4     4 Taking What'… TV Girl       Who … Janu…            4              205650
5     5 What You Want Annaleigh As… Lega… Janu…            5              487893
6     6 Backstabber   Kesha         Anim… Janu…            6              186840
# ℹ 5 more variables: Genre <chr>, `Year of Song Release` <dbl>,
#   `Time Signature` <dbl>, `Pop Score` <dbl>, Language <chr>

# cleaning the data
names(music) <- tolower (names(music)) 
names(music) <- gsub(" ","",names(music)) 

# creating a new variable to convert "tracklength(ms)" into seconds "tracklengthSEC"
# similarly, removing unneeded columns from dataset

music2 <- music

music2 <-  mutate(music2, tracklengthSEC = `tracklength(ms)`/1000)

music2 <- select(music2, -genre) 
music2 <- select(music2, -album) 
music2 <- select(music2, -timesignature) 
  
head(music2)

# A tibble: 6 × 10
  order trackname artistname month monthrank `tracklength(ms)` yearofsongrelease
  <dbl> <chr>     <chr>      <chr>     <dbl>             <dbl>             <dbl>
1     1 Meet The… Grey Hens… Janu…         1            257920              2018
2     2 Tokyo Ca… ATARASHII… Janu…         2            191180              2023
3     3 Daughter… TV Girl    Janu…         3            153440              2014
4     4 Taking W… TV Girl    Janu…         4            205650              2016
5     5 What You… Annaleigh… Janu…         5            487893              2007
6     6 Backstab… Kesha      Janu…         6            186840              2010
# ℹ 3 more variables: popscore <dbl>, language <chr>, tracklengthSEC <dbl>

head(music)

# A tibble: 6 × 12
  order trackname       artistname album month monthrank `tracklength(ms)` genre
  <dbl> <chr>           <chr>      <chr> <chr>     <dbl>             <dbl> <chr>
1     1 Meet The Plast… Grey Hens… Mean… Janu…         1            257920 broa…
2     2 Tokyo Calling   ATARASHII… Toky… Janu…         2            191180 j-po…
3     3 Daughter of a … TV Girl    Fren… Janu…         3            153440 pov:…
4     4 Taking What's … TV Girl    Who … Janu…         4            205650 pov:…
5     5 What You Want   Annaleigh… Lega… Janu…         5            487893 broa…
6     6 Backstabber     Kesha      Anim… Janu…         6            186840 danc…
# ℹ 4 more variables: yearofsongrelease <dbl>, timesignature <dbl>,
#   popscore <dbl>, language <chr>

# taking our edited spreadsheet into tableau by crating a new .csv with the write.csv command

write.csv(music2, file = "spotify_data_for_tableau.csv")

Exploring the dataset with plotting

The following visualizations are exploring the data and the relationship between various factors.

One thing that is immediately clear is how prevalent English is in my top 50. Given that it is my first language, most lyrical music I listen to is in English. However, another thing that stands out is the increase in instrumental music, particularly in April in May. I would likely attribute this to my study playlists, as finals and large project increase as the semester progresses, as I tend to exclusively listen to instrumental in an effort to focus more.

In this Year of Song Release visualization, we can note the trend toward and sharp increase in frequency of songs that enter my top 50 around 2008 and 2010, something I will attribute to this being when I got regularly internet access and began to take agency in the media and art I consumed, so it makes sense that most of the music that would make it into my top 50 would be from around this time and beyond.

This visualization is intentionally set to display only artists with 2 or more tracks in a given month at first. Showcasing all single-track artists can be overwhelming at first viewing, so giving the reader the option to tab into single listens is important to me. There are no immediately apparent takeaways from this visualization, aside from again teh prevalence of English over other types and languages of song. An outlier for this would be “ATARASHII GAKKO!” at 8 multi-track month tracks (10 counting additional single-track months) or “DV-i” with their primarily instrumental tracklist getting them 7 track places across the months recorded.

Exploring Track Length

Track length’s relationship to other variables is something in particular I’m eager to explore. I’ve often theorized that short tracks have an easier time getting on the my top 50 due to their higher ratio of “listens” (full track completion) to listening time (time spent listening to a track overall, rather than individual loop counts). Going into this, even going into making the playlists alone, I expected an overwhelming amount of songs with short listening times.

It’s worth noting that having listened to some musicals earlier this year, longer tracks are more prevalent than usual in this listening data. For example, the two higher outliers for January are two tracks from musicals that are over 8 minutes in full length. This also applies to May’s upper end tracks similarly consisting of musical songs of an irregular song length. However, on the other end of the spectrum are May and June’s shortest overall tracks at just barely over 30 seconds in length, a catchy joke song from Tiktok. Both the boxplot comparison and the histogram give a fair and even showcase of the data. Contrary to more common trends, my increased listens around 150 seconds gives me a second and earlier bump note seen on larger listening tracking sheets (IntelligentMusic.org). Given this mix and variety of tracks, and that tracks as long as previously mentioned were even able to rank in my top 50, I do not believe that short songs have any particular advantage over lengthy counterparts.

Statistical Analysis

As the following graph places Pop Score and Monthly Rank graphs side by side, I felt it only fair to measure how they relate across all months before breaking it down with song length.

ggplot(music2, aes(x = popscore, y = monthrank)) +
  geom_point() +
  geom_smooth(method=lm)+
  scale_y_reverse() +
  labs(title= "Pop Score v. Month Rank")

`geom_smooth()` using formula = 'y ~ x'

lm(monthrank ~ popscore, data = music2)


Call:
lm(formula = monthrank ~ popscore, data = music2)

Coefficients:
(Intercept)     popscore  
     27.602       -0.046

Based on this output, the statistical assumption is that should a track with a pop score of 0 would likely have a monthly rank of 27, which negatively trends as pop score increases. Each additional pop score point would therein “decrease” that likelihood by 0.046 points. In other terms, a less popular song in a month’s top 50 is more likely to have a higher rank (smaller number) in the monthly playlist. I generally consider myself to me into more niche music than most, so it makes sense that music more popular with others would be unlikely to gain traction. However, the -0.046 is so slight it barely makes a difference even if true across both scores.

Overall, I’m surprised to see how varied the trends are month to month. It seems that only March and June of the Monthly Rank graphs have an inverted trend line to the others, implying that longer tracks in those months did not get as many streams/listens as they would in other months. Similarly, it’s interesting to note how similar the trend lines are in April and May across both rank/score graphs. I would think this implies a bit more similarity in how instrumental tracks specifically place across both systems, given their increased prevalence in these months.

Bibliography

Bowenbank, Starr for billboard.com. “Spotify Wrapped: Here’s How to See Your Top Music For 2021” (Nov 10, 2021) https://www.billboard.com/business/streaming/spotify-wrapped-insights-how-to-find-2021-9657709/

“Duration of songs: How did the trend change over time and what does it mean today?” (Aug 9, 2021) https://www.intelligentmusic.org/post/duration-of-songs-how-did-the-trend-change-over-time-and-what-does-it-mean-today