Introduction

For much of my life, music has always played a significant role. I grew up playing the piano and later learned how to DJ in college. Music has allowed me to foster creativity and inspiration during stressful times in my life, which is why I wanted to dedicate this project to exploring one vital question: which factors make a song popular, and how have these factors changed over time?

To analyze this question, I am using the Million Song Dataset Subset, which is a metadata dataset that was created through a collaboration between The Echo Nest and LabROSA at Columbia University. The dataset includes approximately 50,000 observations along with both qualitative and quantitative variables related to the songs and artists. For this project, I cleaned the dataset by removing missing values and songs with unknown release years in order to ensure the data could be accurately used for analysis. The main variables of interest are Song Title, Artist Name, Album, Release Year, Song Duration, Artist Familiarity, Artist Popularity, and a newly created variable called Decade of Release, which was created to analyze music over time. Song Title, Artist Name, Album, and Decade of Release are qualitative variables, as they represent categories rather than numerical units of measurement. The quantitative variables are Release Year, Song Duration, Artist Familiarity, and Artist Popularity. Artist Familiarity and Artist Popularity are scored on a scale of 0 to 1, generated from aggregated listening and metadata patterns. While these two variables do not have traditional units of measurement, they serve as useful indices for comparing relative levels of artist recognition and popularity in the dataset.

This project explores how Artist Popularity, Artist Familiarity, and song characteristics have evolved over time, with the goal of understanding broader shifts in the music industry. By analyzing the relationships between these variables and examining how they change across decades, this project aims to uncover patterns in how music consumption and artist recognition have developed. Each visualization will build on the previous one to provide a more complete picture of how these factors interact and change over time. Also, color choices and visual design were carefully selected with consideration for accessibility, including color vision deficiencies and readability concerns. In addition, descriptive alt text was used for some of the key visualizations. For comparisons, high-contrast palettes were chosen to ensure clarity. Lastly, a consistent visual theme was maintained across all visualizations through color palettes, labeling conventions, and layout structures to ensure a cohesive narrative.

Artist Familiarity vs Artist Popularity

This visualization shows the relationship between Artist Familiarity and Artist Popularity using a scatterplot. Due to the large number of observations in the dataset, a random sample of 5,000 songs was used for this scatterplot in order to reduce overplotting and clutter while still preserving the overall trends and relationships observed in the data. A scatterplot was chosen because it is an effective way to evaluate the relationship between two quantitative variables and identify any potential correlation between them.

Scatterplot showing the relationship between Artist Familiarity on the x-axis and Artist Popularity on the y-axis, with a general positive association between the two variables.

According to the graph, there is a positive relationship between the two variables, showing that artists who are more well known also tend to be more popular. The upward trend line further supports this finding. One disparity in this visualization is that there is a large concentration of points where artist popularity is equal to 0, which suggests that many artists in the dataset have low measured popularity compared to a smaller group of highly popular artists. Additionally, it is important to note that these variables are algorithmically generated scores rather than direct measurements, so interpretations should be focused on relative relationships instead of direct values. This visualization helps begin answering the main question of the project by showing that Artist Familiarity appears to be an important factor in determining Artist Popularity.

Popularity & Duration by Decade

While the earlier scatterplot highlights the relationship between Artist Familiarity and Popularity, it does not show how these patterns have changed over time. The next pair of visualizations show the distribution for Artist Popularity and Song Duration across decades using boxplots. Boxplots were used because they make it easier to observe differences and disparities between time periods by analyzing the median, spread, and outliers in the data.

The Artist Popularity boxplot clearly shows disparities across decades in terms of median popularity and variability. Earlier decades show lower median popularity values and less variability, while the more recent decades such as the 1990s, 2000s, and 2010s show greater variability with more outliers. This may suggest that Artist Popularity has become more varied over time, with a wider range increasing along with more recent decades. These changes could be influenced by shifts in the music industry, technology, and streaming platforms.

The Song Duration boxplot also shows similar patterns and differences across decades. Earlier decades show shorter song duration and less variability, while later decades show longer median song duration and wider ranges of values. There are also more outliers in later decades, indicating that duration has changed over time and modern music may have adapted to more variation in song length compared to earlier decades.

Together, these two visualizations show that both Artist Popularity and Song Duration have changed over time. The similar patterns across decades in both boxplots further supports the idea that music trends have evolved over time, which helps support the main question of this project. To better understand how these relationships evolve, we now examine how Artist Popularity changes across different time periods.

Average Artist Popularity Over Time

This line chart shows the average Artist Popularity over time. This chart is effective because it clearly illustrates trends and changes over time, allowing for easy comparison across different years.

From 1920 to the 1960s, there is more fluctuation with average Artist Popularity. However, from the 1960s and onward, there seems to be less variation and therefore not as much change occurring. This could suggest that Artist Popularity was less stable during earlier periods. The smoothed trend line indicates a pattern where Artist Popularity generally increases from the early years into the mid-1900s, followed by a slight decline and more stability later. One disparity is the difference in stability between earlier and more recent years, where the earlier years suggest more volatility with the average Artist Popularity values and the recent years exhibit more consistency with them.

The overall trend in this line chart supports the finding that Artist Popularity has become more stable over time, possibly due to better technology and more musical resources.

Correlation Heatmap

This correlation heatmap displays the correlations between key quantitative variables in the dataset. This map will allow for quick comparisons of the relationships between the chosen variables. The color intensity on the right represents the strength of the correlation, where the darker shades indicate positive relationships and the lighter shades indicate weaker relationships.

Correlation heatmap displaying relationships between variables such as Artist Popularity, Artist Familiarity, and Song Duration, with color intensity indicating the strength and direction of correlations.

The strongest relationship seems to be Artist Familiarity and Artist Popularity, which has a correlation of 0.73. This is a strong positive relationship, which means that more familiar artists also tend to be more popular. In contrast, the relationships between Song Duration and Artist Popularity as well as Release Year and Artist Popularity have a correlation of zero, which indicates that there is little to no linear relationship with the Artist Popularity variable.

A key takeaway from this visualization is the disparity in correlation strength across variables. Although there is a strong positive association with Artist Familiarity and Popularity, other variables such as Song Duration and Release Year exhibit much weaker relationships. This emphasizes that not all factors contribute equally to Artist Popularity. Overall, the correlation heatmap contributes to the main question of the project by further identifying that Artist Familiarity most strongly associates with Artist Popularity.

Distribution of Artist Popularity Over Time

This animation extends the previous analysis by showing how the distribution of Artist Popularity changes over time. Instead of focusing on a single snapshot, it allows us to observe shifts across decades, revealing patterns that are not automatically visible in static plots. Overall, this animation helps support the main goal of the project by illustrating how Artist Popularity has evolved, providing a new perspective on trends in the music industry.

Animated histogram showing how the distribution of Artist Popularity changes across decades, with shifts in the shape and concentration of values over time. From the animation, it is clear that Artist Popularity is very concentrated at lower values up to the 1990s, indicating that a large proportion of artists have relatively low popularity. However, when the turn of the century occurs at 2000, there is slightly more spread and a greater presence of higher popularity values. This suggests that while most artists stay less popular, there is still an increasing number of artists reaching higher popularity levels in recent years.

One key disparity in this animation is the consistent imbalance between the low-popularity and high-popularity artists. Throughout the entire animated histogram, the majority of artists remain at lower popularity values, while only a smaller portion achieves higher popularity. This suggests that popularity is not evenly distributed and remains concentrated among a limited number of artists.

Overall, this animation supports the main question of the project illustrating how the distribution of Artist Popularity has evolved over time and also highlights disparities that have to do with the popularity distribution among artists.

Decade Comparison Across Variables

This Shiny dashboard provides the user with the opportunity of comparing the different musical characteristics (Artist Popularity, Artist Familiarity, and Song Duration) across two different decades of their choice. By doing this, the user is able to see the dashboard update in real time and see the distribution (via density plot or boxplot) of the chosen variable as well as top-performing artists with the two decades. Also, the bar chart can be customized by setting the number of top artists being shown and providing a threshold for the number of minimum songs per artist. At the bottom, there is a summary of statistics table that provides the mean, median, standard deviation, minimum value, and maximum value of the chosen variable as well as the number of songs included in that decade after filtering.

Explore the Shiny app here: Decade Comparison Across Variables

As shown in the default decades chosen, the density plot shows the difference in Artist Popularity for the 1920s and the 2010s when the number of top artists to show is set at 5 and minimum songs per artists is at 1, revealing that modern artists tend to have higher and more varied popularity scores. Additionally, using the same filtering, the comparison of Artist Familiarity between the 1920s and 2010s shows that modern artists tend to be more widely recognized by listeners, and that the distribution for the 2010s is shifted more further to the right and appears to be more spread out. Also, for Song Duration, the distribution plot reveals that songs in the 2010s tend to be longer than those in the 1920s. The top artist bar chart also supports this, showing that the leading artist in the 2010s has a significantly higher average Song Duration compared to that of the top artist in the 1920s.

This application differentiates itself from the other visualizations by being multi-layered and allowing for the comparison of different time periods between each variable. Additionally, it incorporates important statistical analyses (distributional analysis, ranking, and summary statistics) in one interface, which gives a more comprehensive understanding of the data. Ultimately, this application enables deeper exploration by letting the user control what they want to see or investigate in their analyses of the data.

Distribution of Songs Across Artist Familiarity and Popularity

This Tableau dashboard presents a density-based view of the relationship between Artist Familiarity and Artist Popularity, highlighting where songs are most concentrated within the dataset. Unlike earlier visualizations that place an emphasis on individual relationships or trends over time, this plot shows the overall structure of the data by highlighting areas with the highest density of observations.

Explore the Tableau Dashboard here: Distribution of Songs Across Artist Familiarity and Popularity

This dashboard reveals that the densest region of the plot lies with moderate levels of both familiarity and popularity, indicating that most songs tend to fall within an average range. In contrast, relatively few songs tend to appear in the higher levels of popularity, demonstrating a disparity in the distribution of successful artists. This suggests that while many artists achieve moderate recognition, only a small percentage of them reach the highest levels of popularity. This aligns with earlier findings from the scatterplot and correlation analysis, which showed a positive relationship between familiarity and popularity, but also indicated that this relationship does not necessarily extend to the highest values, or outliers. The elliptical shape of the density pattern also supports the positive association between familiarity and popularity. However, the spread of the density plot also indicates variability, which suggests that there is still the presence of other influential factors that are not captured in the dataset. Another key insight that this dashboard demonstrates is that regardless of the time period, the overall distribution of popularity remains uneven, and the rarity of extreme popularity persists as a consistent pattern.

This visualization differs from the other visuals in the project by focusing on data density rather than individual relationships or trends. While previous graphs explored the interaction of variables over time, this plot provides a broader perspective on the overall structure of the dataset. Lastly, it reinforces the main findings of the project by illustrating that popularity is not evenly distributed and is concentrated among a limited number of artists due to the explored factors, and possibly more unknown variables.

Music Matchmaker

While the previous visualization provided a high-level view of how songs are distributed across Artist Familiarity and Popularity, the Music Matchmaker application shifts the focus from overall structure to individual-level analysis. Instead of examining where songs cluster in the dataset, this interactive application allows users to select a specific song and identify others that are most similar based on chosen characteristics and inputs.

Explore the Shiny app here: Music Matchmaker

This application allows users to define similarity by selecting features such as Artist Familiarity, Artist Popularity, Song Duration, and Release Year. Additionally, users can filter songs by decade, allowing for a more in-depth comparison within specific time periods. By computing similarity using Euclidean distance on standardized features, the application identifies songs that are closest to the selected track in terms of the chosen inputs. The results are then presented through a table of the most similar songs, along with a plot comparing the chosen song to its matches based on Artist Popularity and Familiarity. Also, a summary of the most similar song and most different song is displayed at the bottom.

This interactive approach emphasizes how similarity can vary depending on which features are prioritized over others. For example, selecting popularity-based features may highlight songs by widely recognized artists, while choosing duration or release year inputs may select songs that are only structurally or time-wise similar. This demonstrates that similarity is dependent on how it is compared, further reinforcing the idea that multiple factors can contribute to how songs relate to one another over time. Furthermore, this application builds on the takeaways from the Tableau density plot, specifically with the finding that songs that are matched based on popularity and familiarity often reflect the broader pattern of moderate clustering demonstrated in the density plot while also showing how difficult it is for songs to align with other ones at the highest levels of popularity.

Actual vs Predicted Artist Popularity

While the Music Matchmaker application allowed for interactive exploration of similarity between songs, the prediction plot shifts the focus toward understanding how well key variables in the dataset can be used to predict popularity. This is done by fitting a model, where Artist Popularity is the response variable and Artist Familiarity, Song Duration, and Release Year are the predictors. Then, the predicted popularity values from the model lie on the x-axis while the actual popularity values lie on the y-axis. The diagonal reference line represents perfect agreement between the predicted and actual values. Points that lie close to this line indicate accurate predictions, while points farther away reflect weaker predictions.

Scatterplot comparing predicted and actual song popularity values. The plot shows a positive relationship with moderate popularity values as they cluster near the center of the reference line.

The concentration of points near the diagonal reference line suggest that the model is able to capture general trends in data, specifically for songs with moderate popularity levels. However, the increasing spread of points above and below the line show that the model becomes less accurate for more extreme popularity values. This pattern indicates that extreme values are more difficult to predict, which aligns with the earlier findings that highly popular songs are not as evenly distributed.

These results connect to the main findings throughout the report, which highlight the relationship between Artist Popularity and Artist Familiarity, as well as the uneven distribution of popularity success within the observations in the dataset. The prediction plot builds on this by showing that while these relationships can explain popularity to some extent, they do not fully justify the variation in popularity. Additionally, the dispersion of points around the diagonal line suggests that important factors influencing popularity are not fully captured in the model. This means that song success is shaped by a combination of characteristics and other variables not present in the dataset.

In contrast to the density plot, which showcased the overall distribution of songs, this prediction plot evaluates the accuracy of predictions based on those patterns. The prediction plot solidifies the analysis by showing both the strengths and weaknesses of the predictive approach, supporting general trends while also acknowledging the challenge of predicting extreme outcomes.

Conclusion

The main purpose of this project was to explore which factors shape artist and song popularity and how these factors have changed over time. Through a combination of different plots, Shiny applications, a Tableau dashboard and an animation, several key patterns were identified to help explain the structure of popularity within the music industry.

One consistent finding across all analyses is the strong relationship between Artist Familiarity and Artist Popularity. Both the scatterplot and correlation heatmap demonstrated a clear positive association, suggesting that more well-known artists tend to achieve higher levels of popularity. However, there was also revealed to be variability in this relationship, which suggested that other external factors may be involved.

The visualizations dedicated to examining trends over time revealed that both Artist Popularity and song characteristics have evolved across decades. Both the boxplot and line charts suggested that Artist Popularity and Song Duration have drastically changed and are more variable in recent decades. Additionally, Song Duration tends to be longer with modern songs. With in-depth analysis, the general results suggest a shift toward increasing popularity levels alongside changes in artist recognition and song length. Both of these changes suggest that the modern music landscape allows for more flexibility in terms of song characteristics and popularity levels. The Shiny applications demonstrated that while general patterns exist, the relationships between variables can vary depending on selected time periods and how similarity is defined, suggesting that popularity and song relationships are influenced by a combination of interacting factors rather than a single variable. The Tableau dashboard revealed that most songs are concentrated within moderate levels of familiarity and popularity, showing that successful artists are not evenly distributed. Lastly, the prediction plot demonstrates that while the model captures general trends in song popularity, the accuracy of the model is reduced at extreme values, which highlights the complexity of prediction outcomes driven by multiple factors.

Overall, this analysis shows that while factors such as Artist Familiarity play a significant role in determining Artist Popularity, the broader structure of the music industry depends on complex and competitive models. Popularity is influenced by many interacting factors and stays unevenly distributed across the music industry. Moreover, these findings support the idea that the music industry has evolved over time, reinforcing the importance of examining both relationships and changes over time in order to fully understand these dynamics.