Semester Project Write Up
Background
I decided to do my semester project on the most streamed songs on Spotify from about the 1930’s-2020’s. Spotify is a music streaming app that was created in 2006 and contains millions of songs, podcasts, and audio books. It is a platform that has free services as well as paid subscriptions for anything extra.
The data set I chose for this project was from the website Kaggle and is in the form of a csv file. The file contains categories such as release year, release date, number of streams, and placement in charts of Apple, Spotify, Deezer, and Shazam. It also includes categories such as danceability, liveness, acousticness, and valence. These are factors that are calculated by Spotify by using different methods of algorithms. The file contained 954 rows and 25 columns.
Data Sheet and EDA
I began by cleaning the data a bit. I filtered out what I didn’t need and made sure that all of the values were in a good order. Once I did this, I began the EDA. I looked first at the frequency of streams of each of the songs. The graph is extremely hard to read since each of the streams is in the millions, but is still somewhat interesting. I also ran tests for the mean, median, and mode of the number of streams.
The next code I ran was comparing the danceability percentage for all of the songs compared to their streams. I used a scatterplot for this in order to show some more of the data and have it be a bit more spread out. It was interesting to see the distribution of points. Some songs with higher danceability ratings had higher amounts of streams.
I also wanted to specifically look at certain artists from the data file and compare them to each other. I looked at Taylor Swift, Arctic Monkeys, Coldplay, Ariana Grande, Olivia Rodrigo, and Harry Styles. Each of these artists are extremely popular today which is why I chose to focus on them. I also wanted to focus on them because I listen to some of their music. The first code I ran based on the artists was comparing th danceability and liveness ratings of Taylor Swift, Arctic Monkeys, and Coldplay songs. I used a scatterplot which gave me a nice visual of the data. Taylor Swift had the most songs of these four artists in the data sheet, but only a few of her songs were high in danceability or liveness. The rest of her songs were about the same ratings as Arctic Monkeys and Coldplay.
Next, I compared the rank of Taylor Swift, Cold Play, and Arctic Monkeys in Apple charts versus Spotify charts. The results for both were pretty close, with some outliers. The distributions on the boxplots was fairly different for each of the artists in both Spotify and Apple charts. The t-test again showed a significant difference between the distributions in Apple and Spotify charts.
The last tests I ran for these three artists was comparing the number of their songs on charts versus their release year. I chose to do three different graphs in order to look at the data in different visual ways. From these graphs, we can see how Arctic Monkeys increasingly dominated charts from 2006 to about 2013, while Taylor Swift had fluctuating chart numbers in her release years. Coldplay had a slow beginning, but eventually began to rise in charts.
I also looked at three other artists which were Ariana Grande, Harry Styles, and Olivia Rodrigo. I completed an artist summary, which showed all of there high, medium, or low hits in the charts. None of the artists had medium hits. I formatted this in a table, as well as a stacked bar graph. The graph showed that Harry Styles had the most songs with high hits, and Ariana Grande how the lowest number of hits. This may be because of the amount of songs in the charts for each person; Harry Styles had more songs than both Ariana Grande and Olivia Rodrigo.
Finally, I did some overall digging into the data. I decided to do a cluster graph comparing the danceability rating of all of the songs in the data set compared to their release year. This graph was super interesting in that it shows how more recent songs have a higher danceability rating than songs released from the 1930-1990’s. The earlier 2000’s is where these ratings begin to pick up and from about 2015 to 2025 is where we see the highest number of clustering. I also did a line of best fit graph on top of the clustering to show how the rating has grown or stayed the same over the years. I was also interested in the energy category in the data set. I decided to look at the energy levels over each decade and found there was extremely high energy levels in songs from the 1930’s, no energy levels in songs from the 1940’s (may not have been included in this data set), and a steady energy level from the 1950-2020.
Overall, it was interesting to see all of the different ratings and how they were significant or not. It was also interesting to see the difference in these ratings for songs over the years. I liked seeing how people though certain songs were easy to dance to, had a lot of energy, or was lively or not.