Part 1 Area Of Interest

Background information

This semester I have done a lot with analyzing song data from Spotify and is something that is extremely interesting to me. This has sparked from my love of analytics and my love for music as well. Every year I look forward to seeing my Spotify wrapped and seeing my listening statistics for the year and how they have changed from the year before. I have really enjoyed taking a deep dive into my own spotify data and some of the worlds most popular songs and artists.

After I learned how to use the Spotify API, I was even able to analyzing my own listening data. I am extremely curious between how my listening habits and favorite songs compared to the most popular songs from 2020 through the end of 2021. I am curious as to see how much of an overlap there is and what stands out when I compare the two sets of data. Along with an overall comparison between data sets, I am interested in seeing what makes a song get streamed more than others. Is it the chord its written in? Is it the danceability of the song? I will explore all of these topics in this document.

Data overview

My structured data set is from Kaggle and offers a lot of interesting data on the most popular songs from 2020 through 2021. This includes the artist name, the album the song is on, how many stream the song has, the genre of music, the danceability of the song, the chord the song was written in and more. All of these variables will help me complete my analysis of what makes a song popular and how my own listening history compares to the most popular songs. Below is a complete data dictionary of the data set. Link to the data set:

Data Dictionary

Variable Description
Index Unique id for each song in the data set
Highest Charting Position Highest position the song has hit on the charts
Number of Times Charted How many times the song has charted
Week of Highest Charting The week the song had the highest charting pos.
Song Name Nam of the song
Streams How many times the song has played
Artist Artist Name
Artist Followers How many followers the artist
Song ID Song’s unique ID
Genre Genre of the song
Release Date Release date of the song
Weeks Charted How many weeks the song charted
Popularity Popularity score of the song
Danceability How danceable the song is
Energy Measure of the energy of the song
Loudness Measure of how loud the song is
Speechiness Measure of how much of the song is spoken
Acousticness How instrumental the song is
Liveness Measures the presence of an audience in a song
Tempo Tempo of the song
Duration (ms) How long the song is
Valence Describes the musical positiveness in a song
Chord Chord the song was written in

Data Display

Interesting Summary Stats

The table below summarizes some of the important characteristics of the data and is grouped by the artist or band. In this table, you can find the Artist name, the average artist follower amount, the max streams a song has, the minimum streams a song has, the average song energy, the average song danceability, the maximum song tempo for an artist, the minimum song tempo for an artist and the most used chord songs are written in for an artist. This provides a nice overview into the most popular songs from spotify between 2020 - 2021 data set.

Part 2 Descritpive Analysis

Does the popularity of a song depend on its danceability score?

Summary: Looking at the graph above, you can see that there really isn’t a strong correlation between the danceability and the energy of a song. Songs are scattered in really no order on the graph. I was a little surprised to see this because my initial assumption was that the higher the energy within a song the higher danceability score it would have because a majority of dance music is upbeat and high energy.

Part 3, Secondary Data Source

Background on secondary data

My secondary data set is pulled from my own Spotify account and listening history using the Spotify API. This allows me to see my listening history in real time and has the ability to show me some pretty interesting statistics on my favorite songs, albums, artist and podcast shows too. You are also able to get short term and long term listening habits. For example, maybe in the past 2 months I have been listening to a lot of Olivia Rodrigo songs but in the past 8 months to a year I have consistently listening to Chelsea Cutler an extremely high amount. In the first case with Olivia Rodrigo that would be considered short term with the Spotify API and the long term would be the Chelsea Cutler example. In the few graphs and visuals I will conduct an analysis of my own listening history and compare the top songs from 2020-2021 to my own favorite songs in my Spotify library.

Exploratory / Descriptive Analysis

What are my short term favorite / most played tracks and artists

Summary: The table above shows my recent, short term top 20 songs and their respective artists, and what album the songs is on. This is pretty interesting for me to look at and realize how much I listen to some songs compared to some other songs. It is definitely one of those types of situation where you think you listen to one song all the time but in reality you really aren’t playing it all that much. I can already see some overlap between some of my recently favorite artists and artists who had songs in the top tracks from 2020-2021 data set.

Who are my long term favorite artists and what genre of music do they sing

Summary: The table above shows my top 20 long term favorite artists and the different genres their music falls under. I am not surprised to see Chelsea Cutler, Quinn XCII, and LANY as my top three long term favorite artists because they all have been in my top 5 artists in my Spotify Wrapped the past 3 or so years. It is also interesting seeing some of the genres of music that my favorite artists fall under like “electropop”, “indie pop rap”, “pop rock”, and “indie poptimism”.

Comparision Table of My top Artists & their Genres compared to Spotify’s Top Artists & their Genres

Summary: The table above has my top genres listened to based off my top 10 artists and which genres of music their. Looking at the list, there are no overlaps between any of the artists, however there are some overlaps in certain genres. A few genre overlaps I saw were various types of pop music. The biggest difference between the genres were that most of the genres of music I listen to was some combination of indie pop, pop rock, and electronic pop, where as the top 10 artists in the Spotify data had music in a lot of Latin based music or different countries pop music. I thought this was interesting, because artists typically gain popularity in the country they are from, so it makes sense that a lot of the artists in the spotify data set have specific countries listed in their genres because that’s where they are from or where they were discovered.

Predictive Analysis

Hypothesis: After diving deep into both the spotify data set and my own personal listening statistics, I am curious about seeing what types of variables go into making a song have a high amount of streams. Excluding things like which artist it is and the release date of the song which needed to be excluded because they are not mathematical. In a not statistical sense, the artist and the release date could have a large impact on how many streams a song gets. I wanted to focus on the specific attributes of the song like the chord of the song liveness, daceability, energy, tempo and valence. I think these variables will be interesting to see if they are statistically significant in making a song popular, hence having a high amount of streams.

The generalized regression equation is:

\[Streams = \alpha_i + chord_i + liveness_i + danceability_i + energy_i + tempo_i + valence_i\]

## Analysis of Variance Table
## 
## Response: Streams
##                Df     Sum Sq    Mean Sq F value   Pr(>F)   
## Chord          11 2.0918e+14 1.9016e+13  1.6979 0.068297 . 
## Liveness        1 3.4313e+13 3.4313e+13  3.0637 0.080258 . 
## Danceability    1 1.1026e+14 1.1026e+14  9.8447 0.001736 **
## Energy          1 2.8888e+12 2.8888e+12  0.2579 0.611613   
## Tempo           1 4.6785e+13 4.6785e+13  4.1774 0.041138 * 
## Valence         1 7.4808e+13 7.4808e+13  6.6795 0.009845 **
## Residuals    1528 1.7113e+16 1.1200e+13                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Streams ~ Chord + Liveness + Danceability + Energy + 
##     Tempo + Valence, data = public_clean_spotify)
## 
## Standardized Coefficients::
##  (Intercept)   ChordA#/Bb       ChordB       ChordC   ChordC#/Db       ChordD 
##  0.000000000 -0.046873498  0.016144512 -0.028841717 -0.002134118  0.001635285 
##   ChordD#/Eb       ChordE       ChordF   ChordF#/Gb       ChordG   ChordG#/Ab 
##  0.070531286 -0.033546718 -0.049022629 -0.014224794 -0.020143291  0.015320295 
##     Liveness Danceability       Energy        Tempo      Valence 
##  0.034913409 -0.102904765 -0.016064914  0.049584911  0.074660733
## 
## Call:
## lm(formula = Streams ~ Chord + Liveness + Danceability + Energy + 
##     Tempo + Valence, data = public_clean_spotify)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3662075 -1442994  -866778   102938 41805029 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6944265     692791  10.024  < 2e-16 ***
## ChordA#/Bb    -586487     432746  -1.355 0.175532    
## ChordB         189167     419066   0.451 0.651764    
## ChordC        -323938     409348  -0.791 0.428863    
## ChordC#/Db     -20847     383922  -0.054 0.956704    
## ChordD          20235     429748   0.047 0.962451    
## ChordD#/Eb    1498639     613044   2.445 0.014614 *  
## ChordE        -436547     442510  -0.987 0.324032    
## ChordF        -593813     424955  -1.397 0.162510    
## ChordF#/Gb    -178654     434136  -0.412 0.680753    
## ChordG        -239894     421205  -0.570 0.569073    
## ChordG#/Ab     186222     426383   0.437 0.662356    
## Liveness       817977     603277   1.356 0.175335    
## Danceability -2438466     653855  -3.729 0.000199 ***
## Energy        -335602     575606  -0.583 0.559952    
## Tempo            5656       2914   1.941 0.052454 .  
## Valence       1108586     428942   2.584 0.009845 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3347000 on 1528 degrees of freedom
## Multiple R-squared:  0.02719,    Adjusted R-squared:  0.017 
## F-statistic: 2.669 on 16 and 1528 DF,  p-value: 0.0003551

Summary: Danceability, Valence, Chord D#/Eb have a p value that is less than .05 which make these variables statistically significant and play a crucial part in a song have more streams compared to other songs. Surprisingly the Energy of the song and the tempo are not statistically significant. I was a little surprised by this because I would have thought that both the energy and tempo would have been significant in a song having a lot of streams or getting played more than other songs. As for the R squared, 0.02719, indicates that the independent variables (everything that could effect streams) don’t explain much of the dependent variables (Streams). While the R squared is quite small at 2%, it is quite impressive that the model can predict 2% of variance on whether or not a song gets streamed more than others. This is impressive since a large reason a song gets streamed is based on the artist or band who sings it, and that is not capable of being mathematically or statistically proven.

In conclusion

This project has allowed to explore my interest with spotify, song data, my own listening history, what makes a song popular and why. I hope you enjoyed my deep dive into popular spotify song data from 2020-2021 and learning a little bit more about my own listening history and statistics. If this is of interest to you learn more about how to use the Spotify API you can use the link below. Link: https://www.rdocumentation.org/packages/spotifyr/versions/2.1.1