This semester I have done a lot with analyzing song data from Spotify and is something that is extremely interesting to me. This has sparked from my love of analytics and my love for music as well. Every year I look forward to seeing my Spotify wrapped and seeing my listening statistics for the year and how they have changed from the year before. I have really enjoyed taking a deep dive into my own spotify data and some of the worlds most popular songs and artists.
After I learned how to use the Spotify API, I was even able to analyzing my own listening data. I am extremely curious between how my listening habits and favorite songs compared to the most popular songs from 2020 through the end of 2021. I am curious as to see how much of an overlap there is and what stands out when I compare the two sets of data. Along with an overall comparison between data sets, I am interested in seeing what makes a song get streamed more than others. Is it the chord its written in? Is it the danceability of the song? I will explore all of these topics in this document.
My structured data set is from Kaggle and offers a lot of interesting data on the most popular songs from 2020 through 2021. This includes the artist name, the album the song is on, how many stream the song has, the genre of music, the danceability of the song, the chord the song was written in and more. All of these variables will help me complete my analysis of what makes a song popular and how my own listening history compares to the most popular songs. Below is a complete data dictionary of the data set. Link to the data set:
| Variable | Description |
|---|---|
| Index | Unique id for each song in the data set |
| Highest Charting Position | Highest position the song has hit on the charts |
| Number of Times Charted | How many times the song has charted |
| Week of Highest Charting | The week the song had the highest charting pos. |
| Song Name | Nam of the song |
| Streams | How many times the song has played |
| Artist | Artist Name |
| Artist Followers | How many followers the artist |
| Song ID | Song’s unique ID |
| Genre | Genre of the song |
| Release Date | Release date of the song |
| Weeks Charted | How many weeks the song charted |
| Popularity | Popularity score of the song |
| Danceability | How danceable the song is |
| Energy | Measure of the energy of the song |
| Loudness | Measure of how loud the song is |
| Speechiness | Measure of how much of the song is spoken |
| Acousticness | How instrumental the song is |
| Liveness | Measures the presence of an audience in a song |
| Tempo | Tempo of the song |
| Duration (ms) | How long the song is |
| Valence | Describes the musical positiveness in a song |
| Chord | Chord the song was written in |
The table below summarizes some of the important characteristics of the data and is grouped by the artist or band. In this table, you can find the Artist name, the average artist follower amount, the max streams a song has, the minimum streams a song has, the average song energy, the average song danceability, the maximum song tempo for an artist, the minimum song tempo for an artist and the most used chord songs are written in for an artist. This provides a nice overview into the most popular songs from spotify between 2020 - 2021 data set.
Summary: Looking at the graph above, you can see that there really isn’t a strong correlation between the danceability and the energy of a song. Songs are scattered in really no order on the graph. I was a little surprised to see this because my initial assumption was that the higher the energy within a song the higher danceability score it would have because a majority of dance music is upbeat and high energy.
Summary: The boxplot above shows the distribution of Chords interesting to look at and analyze because some chords are definitely more used than others but there isn’t a stand out chord that is a lot higher than all the rest of the chords. All of the chords average right around 70 to 75 in terms of popularity which is somewhat to be expected since every artist rights songs in all different chords. Along with that, some artists favor some chords over others, but with a large amount artists no one chord stands out more than the next.
Summary: The visual above, shows the distribution of each song’s tempo and how many streams the song has. I removed any song that had over 10 million streams because a majority of songs in the data set have less than 10 million, so this offers a more accurate depiction of the data. There doesn’t seem to be a strong correlation between a song’s tempo and how many streams a song has. Looking at the graph, we can see though that there are a high amount of songs that have a tempo between 100 to 150 and less than 6 million streams.
Summary: The visual above shows the distribution of songs based on the chord it was written in and the liveness of the song. Chords A#/Bb and D have the highest average Liveness. However, all of the chords have many outliers that are higher than the average Liveness score. It would be interesting to take this one step further and see if Energy or Danceability play a part in the song distribution because Live songs often have a lot of energy behind them and can be very danceable.
Summary: The plot above depicts the valence of a song and how many streams the song got. There is visibly no relationship between the valence of a song and how many streams the song gets. A large cluster of the songs have streams between 4 million and 6 million streams and have a valence score between .25 and .75. It is interesting to see that songs that are less positive and happy still get the same amount of streams as the songs with a much higher valence score.
Summary: The table above shows a list of the most popular songs within the data set but the 10 songs with the lowest amount of streams. I was curious of what songs and artists would make this list. Interestingly enough there were still some pretty well known songs on this list, like Lizzo’s “Good as Hell”, Bazzi’s “I.F.L.Y.” and Billie Eilish’s “No Time To Die”. Often times people are most curious about the most popular songs or songs with the most streams, but it was very interesting looking at the data in the opposite way as well of seeing the least popular songs out of the most popular song data. Keep reading as the next table explores the top 10 most streamed songs.
Summary: The table above shows the top 10 songs in descending order, how many streams each song has, and the artist’s name. As an avid music listener, I was surprised to see 2 Lil Nas X songs within the top 10. However songs like “good for u” and “Bad Habits” are songs that took the world by storm, so those are to be expected within the top 10 most streamed and most popular between the start of 2020 and the end of the 2021. I limited the table to just the top 10 songs, because I thought it would offer a nice comparison with the table that shows the 10 songs that are still extremely popular but have the lowest amount of streams.
My secondary data set is pulled from my own Spotify account and listening history using the Spotify API. This allows me to see my listening history in real time and has the ability to show me some pretty interesting statistics on my favorite songs, albums, artist and podcast shows too. You are also able to get short term and long term listening habits. For example, maybe in the past 2 months I have been listening to a lot of Olivia Rodrigo songs but in the past 8 months to a year I have consistently listening to Chelsea Cutler an extremely high amount. In the first case with Olivia Rodrigo that would be considered short term with the Spotify API and the long term would be the Chelsea Cutler example. In the few graphs and visuals I will conduct an analysis of my own listening history and compare the top songs from 2020-2021 to my own favorite songs in my Spotify library.
Summary: The table above shows my recent, short term top 20 songs and their respective artists, and what album the songs is on. This is pretty interesting for me to look at and realize how much I listen to some songs compared to some other songs. It is definitely one of those types of situation where you think you listen to one song all the time but in reality you really aren’t playing it all that much. I can already see some overlap between some of my recently favorite artists and artists who had songs in the top tracks from 2020-2021 data set.
Summary: The table above shows my top 20 long term favorite artists and the different genres their music falls under. I am not surprised to see Chelsea Cutler, Quinn XCII, and LANY as my top three long term favorite artists because they all have been in my top 5 artists in my Spotify Wrapped the past 3 or so years. It is also interesting seeing some of the genres of music that my favorite artists fall under like “electropop”, “indie pop rap”, “pop rock”, and “indie poptimism”.
Summary: The table above has my top genres listened to based off my top 10 artists and which genres of music their. Looking at the list, there are no overlaps between any of the artists, however there are some overlaps in certain genres. A few genre overlaps I saw were various types of pop music. The biggest difference between the genres were that most of the genres of music I listen to was some combination of indie pop, pop rock, and electronic pop, where as the top 10 artists in the Spotify data had music in a lot of Latin based music or different countries pop music. I thought this was interesting, because artists typically gain popularity in the country they are from, so it makes sense that a lot of the artists in the spotify data set have specific countries listed in their genres because that’s where they are from or where they were discovered.
Hypothesis: After diving deep into both the spotify data set and my own personal listening statistics, I am curious about seeing what types of variables go into making a song have a high amount of streams. Excluding things like which artist it is and the release date of the song which needed to be excluded because they are not mathematical. In a not statistical sense, the artist and the release date could have a large impact on how many streams a song gets. I wanted to focus on the specific attributes of the song like the chord of the song liveness, daceability, energy, tempo and valence. I think these variables will be interesting to see if they are statistically significant in making a song popular, hence having a high amount of streams.
The generalized regression equation is:
\[Streams = \alpha_i + chord_i + liveness_i + danceability_i + energy_i + tempo_i + valence_i\]
## Analysis of Variance Table
##
## Response: Streams
## Df Sum Sq Mean Sq F value Pr(>F)
## Chord 11 2.0918e+14 1.9016e+13 1.6979 0.068297 .
## Liveness 1 3.4313e+13 3.4313e+13 3.0637 0.080258 .
## Danceability 1 1.1026e+14 1.1026e+14 9.8447 0.001736 **
## Energy 1 2.8888e+12 2.8888e+12 0.2579 0.611613
## Tempo 1 4.6785e+13 4.6785e+13 4.1774 0.041138 *
## Valence 1 7.4808e+13 7.4808e+13 6.6795 0.009845 **
## Residuals 1528 1.7113e+16 1.1200e+13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Streams ~ Chord + Liveness + Danceability + Energy +
## Tempo + Valence, data = public_clean_spotify)
##
## Standardized Coefficients::
## (Intercept) ChordA#/Bb ChordB ChordC ChordC#/Db ChordD
## 0.000000000 -0.046873498 0.016144512 -0.028841717 -0.002134118 0.001635285
## ChordD#/Eb ChordE ChordF ChordF#/Gb ChordG ChordG#/Ab
## 0.070531286 -0.033546718 -0.049022629 -0.014224794 -0.020143291 0.015320295
## Liveness Danceability Energy Tempo Valence
## 0.034913409 -0.102904765 -0.016064914 0.049584911 0.074660733
##
## Call:
## lm(formula = Streams ~ Chord + Liveness + Danceability + Energy +
## Tempo + Valence, data = public_clean_spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3662075 -1442994 -866778 102938 41805029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6944265 692791 10.024 < 2e-16 ***
## ChordA#/Bb -586487 432746 -1.355 0.175532
## ChordB 189167 419066 0.451 0.651764
## ChordC -323938 409348 -0.791 0.428863
## ChordC#/Db -20847 383922 -0.054 0.956704
## ChordD 20235 429748 0.047 0.962451
## ChordD#/Eb 1498639 613044 2.445 0.014614 *
## ChordE -436547 442510 -0.987 0.324032
## ChordF -593813 424955 -1.397 0.162510
## ChordF#/Gb -178654 434136 -0.412 0.680753
## ChordG -239894 421205 -0.570 0.569073
## ChordG#/Ab 186222 426383 0.437 0.662356
## Liveness 817977 603277 1.356 0.175335
## Danceability -2438466 653855 -3.729 0.000199 ***
## Energy -335602 575606 -0.583 0.559952
## Tempo 5656 2914 1.941 0.052454 .
## Valence 1108586 428942 2.584 0.009845 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3347000 on 1528 degrees of freedom
## Multiple R-squared: 0.02719, Adjusted R-squared: 0.017
## F-statistic: 2.669 on 16 and 1528 DF, p-value: 0.0003551
Summary: Danceability, Valence, Chord D#/Eb have a p value that is less than .05 which make these variables statistically significant and play a crucial part in a song have more streams compared to other songs. Surprisingly the Energy of the song and the tempo are not statistically significant. I was a little surprised by this because I would have thought that both the energy and tempo would have been significant in a song having a lot of streams or getting played more than other songs. As for the R squared, 0.02719, indicates that the independent variables (everything that could effect streams) don’t explain much of the dependent variables (Streams). While the R squared is quite small at 2%, it is quite impressive that the model can predict 2% of variance on whether or not a song gets streamed more than others. This is impressive since a large reason a song gets streamed is based on the artist or band who sings it, and that is not capable of being mathematically or statistically proven.
This project has allowed to explore my interest with spotify, song data, my own listening history, what makes a song popular and why. I hope you enjoyed my deep dive into popular spotify song data from 2020-2021 and learning a little bit more about my own listening history and statistics. If this is of interest to you learn more about how to use the Spotify API you can use the link below. Link: https://www.rdocumentation.org/packages/spotifyr/versions/2.1.1