INTRODUCTION

Nearly everyone listens to music, whether it be through their old Sony Walkman, an mp3 player, the radio, or one of the many now popular streaming platforms such as Spotify or Apple Music. Our data contains Spotify streaming data, which we wanted to use to explore trends and potential relationships in customer listening habits. We divided our analysis into two primary avenues to better explore Spotify listening trends.

The first avenue explores trends among the top Spotify songs from 2010 to 2023. It doesn’t take much listening to realize that popular music has changed significantly over time; just listen to any ABBA from the 70s, Eminem from the early 2000s, or The Weeknd today. It is fairly evident that top songs and genres change over time, but how they change is not as clear. Thus our first question is how have the characteristics of top Spotify songs changed throughout the years 2010 to 2023? Are there any noticeable trends in these changes? The answers to this question could provide valuable insights for record labels, marketing companies, artist managers, producers, songwriters, and vocalists. Understanding trends in popular music is crucial to upcoming artists trying to gain popularity or already popular artists trying to remain at the top of the charts. Adapting to these lucrative characteristics found in top Spotify songs could help drive streams, recognition, and revenue.

Our second avenue of analysis aims to classify genres of Spotify tracks. One of our analysts is a music producer who spent a summer in Chicago studying electronic dance music (EDM) and producing EDM tracks. In doing so, he was able to understand the nuances of major EDM subgenres. Even with hundreds of subgenres, most EDM listeners will say the genre overall has a distinct, beat-heavy sound comprised of undulating synths and massive drops. Due to EDM’s distinct style, we wanted to investigate whether or not we could predict if a song is EDM. Given the Spotify “Million Songs” dataset which contains song data of approximately 114000 random songs from Spotify, can we develop a classification model to determine if a specific song is EDM?

Platforms such as Spotify or Apple Music utilize algorithms to recommend music to customers, and creating a model to classify EDM songs would allow us to begin to understand how Spotify’s recommendation system. We dabble into the process by which Spotify has built a much applauded reccomendation system for its users. Our model could ultimately provide EDM fans with a more personalized experience if it is able to accurately decide whether a song is EDM or not. Additionally, uncovering the characteristics that distinguish EDM could help EDM producers gain a deeper understanding of the genre and produce more meaningful music.

DATA

The Spotify 2020-2021 dataset contains variables that describe each song. This dataset was created by Sashank Pillai, a contributor on Kaggle. There are 1556 songs of different genres and times in this dataset along with 23 variables. Basic information variables include year, genre, release date, etc., and song characteristics including energy, danceability, valence, acousticness, tempo, loudness and speechiness are all scaled between 0 and 1. The 2020-2021 dataset contains all songs from Spotify that were in the Spotify Top 200 Weekly Global Charts in 2020 & 2021. However, this dataset contains songs made years, sometimes even decades, before they made the Top 200 list. Therefore, we used this dataset based on release date to determine the song’s year, and later used this to investigate the trend of songs. Below is a correlation plot of all the attributes by which the songs are measured. This serves to give you an idea of what variables are in the dataset and how they are related to one another.

In creating a classification model for our second question, we did not want to use the dataset containing only top Spotify tracks because it would be biased, neglecting the countless more songs that do not appear in Billboard Top 200’s. Instead, we used the “Spotify Tracks Dataset,” also obtained from Kaggle. This dataset contains 114,000 songs over a range of 125 different genres, with data pulled using the Spotify API. Because of this, the variables of interest are similar to our earlier datasets, looking at popularity, duration, danceability, energy, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, and tempo. This dataset also has additional variables useful for classification, such as whether a song is explicit, the key of the song, mode (major or minor), and time signature. We created a new indicator variable ‘is_EDM’, which is 0 if a song is not part of the EDM genre group and 1 if a song is. Later, we explain the reason and criteria for creating an EDM genre group. The boxplot below demonstrates the average popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, and valence of EDM vs non-EDM songs. While there seem to be some differences, such as higher median popularity, danceability, energy, and loudness, and lower acousticness, the boxplots all overlap, so we cannot make any conclusions.

RESULTS

Question 1

Let’s get to our first question, where we attempt to answer the following areas of concerns through visualization: It is fairly obvious that top songs and genres will change over time, but it is not as clear how they change. How do characteristics of top Spotify songs change between the years of 2010 and 2023? Are there apparent patterns in which attributes are more prevalent in certain time periods than others? Is there some sort of diminishing in certain genres or features?

To create a visualization that encapsulates user learning history on Spotify from 2020-2021, the decision to divide the dataset into five distinct time periods reflects an analytical approach to understanding long-term trends in musical preferences. The range from 1942 to 2021 encompasses a vast array of musical evolutions, and by segmenting this timeline into five parts – 1942-2009, 2010-2015, 2016-2019, 2020, and 2021 – each interval can represent key phases in music history. These could correspond to notable developments such as the birth and maturation of rock and roll, the advent of digital music, the streaming revolution, and the global pandemic’s impact on music consumption. For rock, pop, latin and hip-pop music, in addition to the impact of the epidemic in 2020, people of these genres prefer to listen to songs that were released in the year closest to the present. However, rap music is different. Rap music released from 1942 to 2015 is more popular than rap music released later. From the overall popularity, we can find that, except for the missing data from 2010-2015, Latin music is the most popular. Choosing an animated bar chart allows us to observe the dynamic progression of genre popularity across these defined eras. This form of visualization is particularly adept at illustrating temporal changes because it conveys not only the static information about genre popularity at given points, but also the transition between these points, providing a narrative of how user preferences change in songs in different genres and different years. As pop music has consistently been one of the most popular genres over time, we decided to investigate how its defining attributes have changed throughout the decades. The provided radar charts give us a visual exploration into this evolution, specifically looking at the standardized attributes of pop music from 1963 to 2021. These attributes include Energy, Danceability, Tempo, Valence, and Loudness, which are critical in determining the overall “liveliness” of the music. By calculating the confidence interval between upper bound and lower bound of each attribute, we are able to observe the general liveliness and the variability of songs in different eras. From a broad perspective, the trend in pop music from the expansive era of 1963 to 2009 shows a wide diversity in emotional content and energy, with a significant contraction in variability observed in the following years. By 2020 and 2021, the data suggests a convergence towards a more standardized pop sound, with songs closely adhering to a specific set of energetic and emotional characteristics.

After cleaning out the 2020-2021 Spotify dataset we managed to create several pie charts showing the changes in the genre distribution over time. We used a for loop combined with the string detect function to aggregate different genres into the more general genres of “hip hop”, “latin”, “pop”, “rock”, and “other.” These general genres were selected because they each had at least 30 occurrences in both 2020 and 2021. Since some songs were released years if not decades before they charted on the Top 200 list, we decided to use the “Release Year” variable instead of the “Weeks Charted” variable to better represent the music of each era. To better visualize trends, we organized the years into the following bins, “1942-2009”, “2010-2015”, “2016-2019”, “2020”, and “2021.” The pie charts based on the aforementioned bins show how the distribution of genres in the top songs dataset changes over time. We can draw a few meaningful conclusions from this visual. For one, rock and other songs made before 2009 tend to make a resurgence in the Spotify 2020-2021 Top 200 charts. The same cannot be said for latin, hip hop, and pop songs. The proportion of pop has declined between 2010 and 2021, from just over half of all songs to about 1/3 of all of the top songs. Meanwhile, the proportion of latin songs has slightly increased from 2016 to 2021. The share of hip-hop songs has steadily increased since 2010, from less than 5% in the 2010-2015 bin to about 25% of top songs in the 2021 bin.

Question 2

Given the Spotify tracks dataset containing song data of approximately 114000 random songs from Spotify, can we develop a classification model to determine whether a song is EDM or not? In our initial approach to tackle this question, we considered a binary classification model using logistic regression. We started with a model using the predictors “popularity + duration_ms + explicit + danceability + energy + loudness + as.factor(key) + as.factor(mode) + speechiness + acousticness + instrumentalness + liveness + valence + tempo + as.factor(time_signature),” but did not consider any interactions. We then ran 10-fold cross-validation to calculate the model’s sensitivity, its ability to identify true positives, and specificity, its ability to identify true negatives. This basic model had a sensitivity of 0.000 and a specificity of 1.000, meaning it was completely ineffective at identifying EDM songs but perfectly predicted when a song was not EDM. We then considered first-order interactions such as popularity * danceability, which could capture the possibility EDM songs tend to be more popular and danceable, or speechiness * acousticness, which could capture the possibility EDM songs tend to have less words and acoustic instruments. Unfortunately, this first-order model still had a sensitivity of 0.000 and specificity of 1.000. Considering second and third-order interactions still did not improve the sensitivity, which stayed at 0.000. Due to the complexity of the third-order model, we then ran a backwards selection method to obtain the “best” model, based on AIC, which is an estimator of prediction error relative quality of statistical models. This yet again yielded a sensitivity of 0.000 and specificity of 1.000. After considering five models all without any true positive predictions, we considered the possibility that the models were unable to predict due to the relatively small number of EDM tracks as compared to over 100,000 other tracks. As an alternative approach to improve prediction, we now wanted to consider clumping together similar subgenres of EDM. We ran a k-nearest-neighbors algorithm to calculate the distance of EDM to all other genres based on the standardized means of each attribute and obtained a table with the nearest genres, pictured below.

track_genre	d
house	0.8686239
electro	1.3889993
alternative	1.8797667
groove	1.9128310
alt-rock	1.9226026
dance	1.9534957
progressive-house	1.9810850
deep-house	2.1204474
electronic	2.2826907
reggae	2.3171973

Many of the closest genres, such as house, electro, groove, dance, progressive house, deep house, and electronic make perfect sense to be closest in distance to others. Some surprising genres to see were alternative, alternative rock, and reggae, but it makes sense that they would be similar in attributes to EDM, due to their beat and instrument-heavy styles. We then aggregated the top 19 closest EDM subgenres, which were any subgenres closer than 6 in distance, into a group to rerun each of the five models we created earlier. We were finally able to achieve a sensitivity greater than 0.000 after aggregating subgenres, although specificity dropped slightly. The table and bar chart below show how each model performs before and after clumping.

Model	Sensitivity	Specificity
No Interactions	0.0000	1.0000
First Order	0.0000	1.0000
Second Order	0.0000	1.0000
Third Order	0.0000	1.0000
Backwards Selection	0.0000	1.0000
No Interactions (Clumped)	0.0449	0.9943
First Order (Clumped)	0.0700	0.9917
Second Order (Clumped)	0.0979	0.9889
Third Order (Clumped)	0.1159	0.9870
Backwards Selection (Clumped)	0.1145	0.9875

From the table, we can see that clumping genres improves the model’s sensitivity, and that higher order models had higher sensitivity, with the highest being approximately 11.6%. The “best” model from backwards selection actually had a lower sensitivity, but this is to be expected since removing predictors naturally lowers the prediction effectiveness. Overall, it seems that the binary classification model is not particularly effective at classifying songs as EDM or not. Since our binary classification model is not yet adequate enough to effectively predict whether a song is EDM or not, let us instead approach this problem with a slightly different model. This time, we will consider a k-Nearest-Neighbor algorithm. The basic methodology is as follows; We will still contain an EDM group, by creating a variable ‘is_edm’ to contain the true classification of each song. Then, we will split the data into a training and testing set with an 80/20 split respectively. We will train the knn model on the training set and then attempt to classify each song in the testing set as EDM or not EDM. For each song in the test set, the model finds the Euclidean distance between itself and every other point in the training set, finds the 5 songs which are the closest in distance, and classifies it as whichever class occurred the most in those top 5. We chose 5 for k by using 10-fold cross validation to optimize k. This knn method in contrast to any other previous model yielded the most accurate results. We tried changing how many genres defined our edm group and many different values for k, but ultimately, keeping the top 19 closest genres to EDM in the EDM group and using the 5 nearest neighbors gave us an accuracy of .85, and sensitivity of .94, and a specificity of .46. Our model correctly predicted whether a song was EDM or not 85% of the time.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17615  2166
##          1  1185  1834
##                                           
##                Accuracy : 0.853           
##                  95% CI : (0.8484, 0.8576)
##     No Information Rate : 0.8246          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4377          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9370          
##             Specificity : 0.4585          
##          Pos Pred Value : 0.8905          
##          Neg Pred Value : 0.6075          
##              Prevalence : 0.8246          
##          Detection Rate : 0.7726          
##    Detection Prevalence : 0.8676          
##       Balanced Accuracy : 0.6977          
##                                           
##        'Positive' Class : 0               
##

	Accuracy	Sensitivity	Specificity
Accuracy	0.8530263	0.9369681	0.4585

CONCLUSION

In this paper, we wanted to address two questions: how trends in top Spotify songs changed over time and whether we could classify songs as EDM or not.

In our first question investigating pop songs’ attribute over time, our analysis of Spotify user data reveals significant trends in genre popularity with older tracks remaining more popular, suggesting the value of historical and cultural significance within the genre. Latin music consistently ranks high in popularity, indicating its broad appeal and successful marketing strategies. On the other hand, pop music has seen a shift towards a more homogeneous sound in recent years, which could reflect the music industry’s adaptation to streaming platforms’ algorithms. The radar charts reveal a shift in the musical landscape of pop from 1963 to 2021; pop music displayed a rich diversity in these characteristics initially, but recent years have seen a trend toward uniformity, with songs from 2020 and 2021 showing less variability and more standardization in these attributes. These trends highlight evolving listener preferences and the dynamic nature of music popularity over time. In our second question regarding whether a song can be categorized as EDM, upon employing a k-Nearest-Neighbor (kNN) algorithm, optimized through cross-validation, we achieved a more accurate and reliable model. These results indicate that, despite the complexities and numerous subgenres within EDM, the genre does possess distinctive characteristics that can be effectively captured and recognized through machine learning models.

The way pop songs’ attributes changing over time could signify an industry-wide calibration towards what is deemed popular or marketable in the realm of pop music, potentially influenced by various factors such as technological advancements, changes in consumption habits, and the global reach of music streaming platforms. Analyzing trends over time provides valuable insights into changing listener preferences and cultural shifts, which are essential for artists, record labels, and marketers in strategizing their content creation and marketing efforts. By recognizing patterns, such as the increasing popularity of certain genres or the resurgence of others, stakeholders can make informed decisions about what music to produce and promote. Future research may investigate the relationship between popularity and song characteristic, and whether each era has its unique preference of songs and the reason behind this. Secondly, the ability to accurately classify songs as EDM or not, as demonstrated by our successful implementation of a k-Nearest-Neighbor algorithm, opens up opportunities for personalized music recommendation systems. Such systems can enhance user experience on streaming platforms by accurately suggesting songs that align with their tastes, particularly in genres like EDM with a distinct and recognizable style. Furthermore, our analysis, which yielded high accuracy and sensitivity, illustrates the efficacy of machine learning in understanding and categorizing complex musical landscapes. This not only aids in better music discovery for listeners but also provides valuable data for musicologists and cultural researchers studying the evolution and characteristics of different music genres. Therefore, our analysis is not just about understanding music trends and classification; it’s about leveraging these insights to impact the music industry’s production, distribution, and consumption in a data-driven era.

Keeping in mind what conclusions we made, we should also address potential drawbacks in our analysis and datset, in particular. Other datasets like user demographic information such as age, location, and cultural background, coupled with individual listening habits could vastly enrich our understanding of genre popularity and the nuanced appeal of specific music styles like EDM. Such data would facilitate a more granular analysis, revealing trends influenced by diverse listener groups. Additionally, detailed metadata about subgenres, artists, and albums, as well as social media interactions and sentiment analysis derived from song lyrics, could provide a richer context and a more nuanced perspective on what drives a song’s popularity. Incorporating the release years of Spotify tracks as an additional predictor in our model could significantly enhance its predictive power and accuracy. This data can then be integrated into our existing dataset, providing a temporal dimension that might be crucial in understanding and predicting music trends. The release year of a track can be a vital predictor, offering insights into the evolution of musical styles and preferences over time. By adding release years as a variable in our model, we can explore how the popularity of genres like EDM or the characteristics of top Spotify tracks have changed through different eras. This temporal perspective could also reveal historical patterns, helping to predict future trends and better understand the cyclical nature of music popularity. The inclusion of these data types in our analysis would not only refine our current understanding but also open new avenues for exploration, leading to more comprehensive and actionable insights in the dynamic and evolving landscape of music preferences.

Final Paper

STOR 320.02 Group 4

December 25, 2023