Hello! My name is Alex Grotjan. I’m from Dayton, Ohio, and I am a student at Xavier University studying Business Analytics and Information Systems (Class of 2026). As part of my Programming in Analytics class, I was tasked with performing an analysis on a data set from a topic of my choosing. I chose to do my analysis on music to see what kinds of factors influence streaming numbers, engagement, and overall song popularity. I hope you enjoy!

Part 1: Intro, Question, and Primary Dataset
Ever since I was little, music has been a huge part of my life. I have had countless experiences where music brings me to places of high emotion, and I think that many others would say the same. Music is an important part of our culture, and that is why understanding how it gets distributed, and what makes certain songs popular is so valuable. Not only does analyzing music popularity help us make better business decisions, it helps us understand our current culture. Through this project, I set out to answer 2 main questions:
What are the strongest attributes that contribute to song popularity?
Beyond music popularity, what makes music more engaging?
If we know what the strongest factors for song popularity are, then we can better understand where our current culture is in terms of music, and we can better understand what makes a hit song too. Furthermore, if we can understand what makes music more engaging, whether it be a cool music video or a high tempo, we can get a better picture of why certain music has a tighter community of listeners, or why certain music has more of a fan base behind it rather than mostly casual listeners.
Primary Data Source
Note: You must open any links from this webpage in a new tab or they will not open.
For this project, my main data source was called “Spotify and Youtube,” and it came from Kaggle (Link:Spotify_and_Youtube_Kaggle_Page). The data set records information about individual songs and their attributes like song name, artist name, danceability, loudness, YouTube video views, Spotify streams, and much more! (See the data dictionary later in this section or the Kaggle page for a full breakdown of the fields.)
Below is the primary data set that I used for this project. The full data set has about 20,000 rows. Each row represents one song (duplicates were removed for songs with multiple artists). You can use the horizontal scroll bar to explore all of the columns in the table below:
You can view the full data set at the link below:
Music_Data_Spotify_and_Youtube.csv

Summary of the Data/Summary Statistics
Note: the n_missing stat in the table below records the number of rows in the data set without a value for a specific attribute.
As you can see in the summary table below, some of the attributes have missing values like Url_youtube and Channel. This makes sense because not all of the songs will have YouTube videos. Other values like streams and likes have some missing values as well, but out of 20,000 songs, a few hundred songs missing the value does not make the data set unusable. If a larger portion of the data were missing these values it would be smart to use another data set, but this one is only missing popularity/engagement metrics for about 3% of the data. We just have to be mindful that some songs are excluded when looking at streams and likes and such.
The table also records the range, max, min, mean, standard deviation, and distribution of values for all of the numeric values. One thing that sticks out is how the views, likes, comments, streams, and duration attributes are all skewed positively, which makes sense since there are a few songs in the data set that will be the most popular and have way more streams/likes than most of the other songs. There are also some insanely long songs in the data set which skew the duration distribution to the right. Overall, the majority of the numeric variables have at least some skew, suggesting that there may be some outliers when it comes to our different variables. It is hard to justify filtering out potential outliers in terms of high streams or high likes though because those songs are important in determining what makes a song popular. Those top 10-20 songs in terms of streams or likes are very important to understand what makes a huge global hit. Keep this in mind as we look at the different songs and what influences their popularity and engagement.
Part 2: Descriptive Analysis
Now that we have done some basic exploration of the primary data set, let’s begin to explore the questions of “what makes a song more popular,” and “what makes a song more engaging.” First let’s look at a table of correlation coefficients to get a general idea of what variables have a positive or negative relationship with Spotify streams.
Likes, views, and comments have the highest correlations with streams, which would make sense. More engagement and popularity on YouTube videos goes hand in hand with more people listening to the song in general. Besides the popularity attributes, loudness, acousticness, and instrumentalness have the highest correlations with streams. The correlation with loudness is positive and this would make sense since songs with a louder sound would be more attractive to the average listener and catch their attention more. Acousticness and instrumentalness both have a negative correlation with streams. This informs us that popular songs usually have more production than an acoustic track and usually steer clear of a ton of instrumental parts. High energy and danceable tracks appear to be a factor in popularity as well with their correlations being the 7th and 8th strongest on the chart. The correlations are positive which suggests that songs with more energy and danceability tend to be more popular.
Overall, these statistics help us answer part of our question of what makes a song popular where people stream it a lot on Spotify. We can see that tracks that have a lot of noise, energy, and danceability tend to be more popular. We also see that tracks with less instrumental parts and less acousticness tend to do better on Spotify as well. This is a good overview of the data, but there is definitely some deeper analysis we can do to see what makes a song more popular. Next I want to focus in on only songs that have over a billion streams and see if there are similar correlations between each attribute and the number of Spotify streams. (Below is the same correlation chart but only for songs with at least 1 billion streams on Spotify)
The results here are actually pretty surprising. Likes, views, and comments have much smaller correlation coefficients than before, suggesting that engagement and YouTube presence does not matter as much once you get to songs with at least 1 billion streams. Energy and acousticness now have opposite correlations with stream, suggesting that for the most popular songs, acoustic songs with a little less energy (overall intensity and activity) have a better chance of succeeding. Speechiness and liveness are now up there as more important factors in popularity, and they both have a negative correlation with streams. This suggests that songs with too many lyrics or songs that are in a live album tend to do worse when it comes to songs with 1 billion streams or more on Spotify. Loudness and instrumentalness are no longer up there with the highest correlation coefficients, which suggests that a song’s loudness or amount of instrumental is less relevant than things like energy, acousticness, or speechiness for the highest streamed songs on Spotify.
So this answers our question a little more fully about what makes a song more popular, as we can see how the issue is more complex than what you see at first glance, with certain variables being more important in their relation to streams for popular songs with billions of streams. Before we move onto our second main question about what makes songs more engaging, I want to explore song popularity through the lens of song duration. Even though it did not have the highest correlation value across both tables above, it was number 5 for the second correlation table, and it showed a negative correlation. This means the songs with longer run times tend to have less streams. It would be useful to know a general cutoff of where a song becomes too long to have a good chance of success in terms of popularity. Below I made a scatterplot to show the duration and streams of all songs under 20 minutes. I also labeled some of the most popular songs as well as some of the longer songs so that we can see exactly which songs are performing well at different song lengths.
The scatterplot shows that the most successful songs are usually rap and pop type songs that are between 2.5 to 4 minutes long. This makes sense since a song length of around 3 minutes is enough time to craft a story and have a good song structure without losing the attention of the average listener. This information is helpful, but I am interested more so in what the song length cutoff is for popularity, and this chart shows that around 7-8 minutes is where the popularity really starts to plummet. It is also worth noting that some of those longer songs that are very popular on the chart are older songs (November rain, American Pie, Purple Rain – all of these are over 30 years old), which suggests that the cutoff of song duration for popularity is a little bit shorter when making music today.

Stairway to Heaven is one of the most popular long form songs of all time. Not very many songs longer than this have had major mainstream success. (Free bird is another example but again, it is not a recent song)

Ed Sheeran’s Shape of you is a good example of a song that is mega-successful, and it’s in the sweet spot for song duration at about 4 minutes.
Now that we have done a good amount of exploration into what makes a song popular, let’s shift gears and dive into what makes a song more engaging. Why is it that certain songs get more people to engage with the content associated with the music? We can use a lot of the YouTube attributes in our primary data set such as comments and likes to explore this question. As you can see below, the number of Spotify streams for a song and the number of likes it gets for the song’s YouTube video are strongly related, indicated by the upward trend line. Surprisingly though, the relationship is not nearly as strong when you look at streams and comments, indicated by the flat trend line.
These scatterplots definitely point to the fact that there is something else at play that affects engagement besides how popular the song is in terms of Spotify streams. Let’s look at correlations between our numeric variables and “comments,” and then do the same for “likes.” If there are any glaring differences between the two correlation tables or something that sticks out for both tables, we could look deeper into those attributes that appear to have an effect on stronger engagement.
There is a decent amount to unpack when it comes to these 2 correlation tables. Below I explain some of the most important takeaways and interpretations that I got from the information:
Streaming numbers are less of a significant factor when it comes to comments. The correlation coefficient for comments and stream is only 0.26 vs a correlation of 0.66 for likes and stream. This means that even though a song may be popular, that deeper engagement with comments is not always going to go hand in hand with its popularity. There are other factors at play. Actually, comments have the highest correlation with likes. This suggests that songs that already have that surface level engagement of clicking the like button tend to get more of that deep engagement with comments. So if a music video or song can elicit a positive response in the viewer and get them to hit the like button, that is the best way to increase the number of comments. It is not just about popularity, but about getting the viewer to feel something strong enough where they would want to click the like button and then make a comment. So when making a music video or promoting a song, it is important to keep in mind how you will get likes on the song’s music video if you want that deeper engagement in the comments.
Likes and views have a very strong positive correlation at 0.89, which is extremely high compared to most of the other correlation coefficients. This is most likely because more people watch the YouTube video and then end up streaming the song more. Even though the correlation between views and comments is only 0.41, the fact that comments are strongly correlated with likes and then likes are most strongly correlated with views, means that there is a chain of events happening here that are all very important. Views having that correlation of 0.89 feels even more important when thinking about that chain of events (views–>likes–>comments).
Besides the popularity and engagement metrics like views and likes, the thing that sticks out the most in the 2 tables is the importance of loudness for getting both likes and comments on a song. The 4th strongest correlation coefficient for both likes and comments is loudness. This makes sense as louder songs would be more attention grabbing.
Also, danceability is shown to be more important than other attributes for getting likes on a song with a correlation of about 0.1, which is the 5th highest correlation for likes. This could again be because danceability makes the song more attention grabbing and makes you more likely to have some emotional response to the song and hit the like button.
I think we have made a good exploration of how different variables affect engagement, but let’s look deeper into loudness and danceability to see a more detailed view of how those attributes affect song engagement. I chose to look at likes compared with these two attributes because their correlations are stronger with likes than with comments. First let’s look closer at loudness.
The scatterplot above shows that all of the songs that do really well with likes are the louder songs. The reference line at 10 million likes makes it clear that none of the songs quieter than -15 db are really successful when it comes to likes. Let’s take a look at danceability in a scatterplot with likes as well to see if there is a similar pattern.
We do see a similar pattern here where the highest liked songs are mostly songs with high danceability, but it is not as extreme as it was for loudness. It looks like most of songs with a lot of likes are at least up there at around 0.3 danceability. This means that songs that are easier to dance to and have some consistent tempo with a well produced atmosphere are more likely to be successful in terms of likes on YouTube.
So we have established that the songs that are louder with more danceability are usually more engaging, but let’s go one step further and split up the songs by release type. I am curious to see how the number of comments differs if the song was released on an album, as a single, or as a compilation album.
This bar chart shows that when it comes to those louder and danceable songs, there is a sizable difference in the average number of comments made across the different release types. This suggests that albums are usually more engaging for listeners when it comes to these types of songs.


Above is artwork from Opeth’s compilation album from 2009 vs artwork from its studio album Blackwater Park from 2001.
So for overall engagement, loud danceable songs with lots of views on YouTube tend to be better in the likes department, while songs from album releases that also have a large amount of likes and views are usually going to get the most engagement for comments. An important takeaway is that it is not just about how popular the song is when it comes to engagement. It’s also about the atmosphere and the character that the song has. This makes people interested enough to hit the like button or make a comment on the YouTube video.
Now we have an understanding of what affects song engagement and popularity. We looked through a lot of different attributes in our data and compared across groups, but all of our analysis was using a single data source. Our primary data set looks at data from only Spotify and YouTube. Let’s use an additional data source to get a broader view of song popularity across different platforms.
Part 3: Secondary Data Source

To supplement our analysis of streaming data from Spotify and Youtube, I decided to get some data from a music tracking service called Last.fm. If you would like to check out the homepage for Last.fm, you can use the link provided here: Last.fm
It is basically a platform that tracks your streaming activity across different platforms like Youtube, Spotify, and much more (it also has a unique user base of music enthusiasts). I am interested in seeing how play counts compare between the primary data source and the data from Last.fm. Once we compare the two data sets, we can better understand the differences in listeners between the 2 services, which will inform us further about what makes a song more popular.
To obtain a secondary data source, I used the API for Last.fm to retrieve data for the 100 most popular songs by Spotify streams. This way we can compare the same 100 songs across the two data sets. The data I collected is below:
note: Each row is a song (no duplicates) and each column contains some interesting information about the song such as:
Artist_name: name of the artist
name: name of the song
duration: length of the song in milliseconds
listeners: How many unique accounts have listened to the track
playcount: how many times the track has been scrobbled/listened to. A track is scrobbled when a listener listens to at least half the duration of a track, or at least 4 minutes of a track, whichever comes first
album_title: title of the album
top_tags: mostly frequently used tags for the song
Here is a csv of the data for your reference:
Lastfm_API_Data
If you would like to get some of your own music data from the Last.fm API, use the tutorial below (if opening a link in this tutorial be sure to right click and open link in a new tab or it will not work):
Last.fm API Tutorial
Now we can compare the data from the API with our primary dataset to see how the play counts compare between Spotify and Last.fm. Let’s look at the streaming numbers by genre.
Note: The bars below are based on the genre’s average number of plays/streams per person on each platform. I standardized the values to per person because way more people use Spotify than last.fm, so it is a more worthwhile comparison when you look at stream/playcounts per person instead of the total streams/playcounts (700 million people use Spotify vs only 40 million use last.fm).
This dodged bar chart shows the average streams/playcounts per user of the platform for multiple different genres. For example: Top streamed Rap songs get just under 4 listens per user through Last.fm and just under 1 listen per user on Last.fm.
The listens per user of the platform for pop songs is pretty similar for both services with a difference of about 1 stream per listener, but Spotify gets quite a few more streams per listener for other genres, especially for electronic and rap. This suggests that listeners on last.fm are not listening as much to non-pop genres. Another way to put it is if a person on Last.fm is listening to a top 100 streamed song, they are more likely to be listening to a song tagged as pop rather than another genre. This is interesting and it’s insightful if you want to cater music to an audience that uses Last.fm. If there is a popular release with lots of streams that is of the pop genre, it would not be bad idea to suggest that song more to Last.fm users on the website, since they gravitate more towards those songs over songs in other genres. This can help Last.fm keep people on the service by creating more accurate song suggestions for users on the website. Overall, the secondary data set from Last.fm is a good set of information to compare with the Spotify data to see what makes songs popular across platforms.
We have now looked at both the primary and secondary data source, observing all kinds of different attributes to see what affects song popularity and engagement. Let’s conclude by summarizing some of our findings and explaining some real world-applications of our analysis.

Conclusion
Overall, some of the biggest insights from this project are as follows:
Songs with more loudness, energy, and danceability tend to be more popular and songs with high acousticness and instrumentalness tend to be less popular. Things like loudness and instrumentalness matter less when it comes to songs with at least 1 billion streams though. The songs with at least 1 billion streams actually tend to do better with more acoustic attributes and less energy. This is to say that there are a lot of incredibly successful songs on Spotify that don’t necessarily have to be the most energy filled and highly produced tracks.
8 minutes is about as long as a song can get and still have a good chance of being popular. That number is not a hard cutoff though, and it is likely that 6-7 minutes would be a better rule of thumb to follow in today’s music market. A lot of the successful 8-9 minutes songs are from upwards of 30 years ago or more.
Song engagement such as likes and comments on the song’s YouTube video are strongly related with popularity metrics such as streams and views, but things like loudness and danceability are also important factors. The most engaging songs on Spotify (lots of likes on their YouTube videos) are songs with a loudness level above -15 db, and also songs with danceability levels above 0.3. This means that beyond having pure popularity, a song needs to have some atmosphere or character to it to increase its engagement numbers. It was also found that the release format of the song as an album is quite important for these types of songs that are loud and danceable to perform well in terms of getting a lot of comments.
When looking across platforms, it is clear that song popularity varies across Last.fm and Spotify. When looking at the top 100 streamed songs from Spotify, Pop songs were often listened to more evenly across the two services, but all of the other genres such as electronic, rock, and rap had less listens per user from Last.fm than Spotify. This suggests that Last.fm users listen more to pop than Spotify listeners when it comes to highly streamed songs.
Industry professionals and promoters can use all of this information to better promote songs and sign bands/artists that they believe will perform well. For example, if we know a song has the attributes to be really engaging because of its loudness, danceability, and ability to get lots of views on YouTube, a promoter can better advertise and plan the music release. Besides people in the industry, everyday people can benefit from this analysis to see what kind of music is more popular and engaging. I hope that my analysis was able to feed your curiosity about music and give you some insight into what makes a song popular and engaging. Now go enjoy some awesome music!😁
This concludes my project. I would like to say thanks to my professor Joel Asay at Xavier University (Cincinnati, OH) for teaching me all about programming in R for analytics, and I would also like to thank the whole Business Analytics department at Xavier for preparing me for the professional world of data. This is my last Business Analytics class as an undergraduate and it was an absolute blast!
All the best,
Alex Grotjan