Introduction

For much of my life, music has always played a significant role. I grew up playing the piano and later learned how to DJ in college. Music has allowed me to foster creativity and inspiration during stressful times in my life, which is why I wanted to dedicate this project to exploring one vital question: which factors make a song popular, and how have these factors changed over time?

To analyze this question, I am using the Million Song Dataset Subset, which is a metadata dataset that was created through a collaboration between The Echo Nest and LabROSA at Columbia University. The dataset includes thousands of observations along with both qualitative and quantitative variables related to the songs and artists. For this project, I cleaned the dataset by removing missing values and songs with unknown release years in order to ensure the data could be accurately used for analysis. The main variables of interest are song title, artist name, album, release year, song duration, artist familiarity, artist popularity, and a newly created variable called decade of release, which was created to analyze music over time. Song title, artist name, album, and decade of release are qualitative variables, as they represent categories rather than numerical units of measurement. The quantitative variables are release year, song duration, artist familiarity, and artist popularity. Artist familiarity and artist popularity are scored on a scale of 0 to 1, generated from aggregated listening and metadata patterns. While these two variables do not have traditional units of measurement, they serve as useful indices for comparing relative levels of artist recognition and popularity in the dataset.

The goal of this project is to explore how song characteristics and artist popularity have evolved over time and to examine relationships between the main variables of interest to identify meaningful patterns. A variety of visualizations will be used, including Tableau dashboards, animations, Shiny applications, and ggplot graphs. These visualizations will incorporate all of the variables and help illustrate the broader story of how songs and artists have changed over time.

Artist Familiarity vs Artist Popularity

This visualization shows the relationship between artist familiarity and artist popularity using a scatterplot. Due to the large number of observations in the dataset, a random sample of 5,000 songs was used for this scatterplot in order to reduce overplotting and clutter while still preserving the overall trends and relationships observed in the data. A scatterplot was chosen because it is an effective way to evaluate the relationship between two quantitative variables and identify any potential correlation between them.

According to the graph, there is a positive relationship between the two variables, showing that artists who are more well known also tend to be more popular. The upward trend line further supports this finding. One disparity in this visualization is that there is a large concentration of points where artist popularity is equal to 0, which suggests that many artists in the dataset have low measured popularity compared to a smaller group of highly popular artists. Additionally, it is important to note that these variables are algorithmically generated scores rather than direct measurements, so interpretations should be focused on relative relationships instead of direct values. This visualization helps begin answering the main question of the project by showing that artist familiarity appears to be an important factor in determining artist popularity.

Popularity & Duration by Decade

These visualizations show the distribution for artist popularity and songs duration across decades using boxplots. Boxplots were used because they make it easier to observe differences and disparities between time periods by analyzing the median, spread, and outliers in the data.

The artist popularity boxplot clearly shows disparities across decades in terms of median popularity and variability. Earlier decades show lower median popularity values and less variability, while the more recent decades such as the 1990s, 2000s, and 2010s show greater variability with more outliers. The may suggest that artist popularity has become more varied over time, with a wider range increasing along with more recent decades. These changes could be influenced by shifts in the music industry, technology, and streaming platforms.

The song duration boxplot also shows similar patterns and differences across decades. Earlier decades show shorter song durations and less variability, while later decades show longer median song durations and wider ranges of values. There are also more outliers in later decades, indicating that duration has changed over time and modern music may have adapted to more variation in song length compared to earlier decades.

Together, these two visualizations show that both artist popularity and song duration have changed over time. The similar patterns across decades in both boxplots further supports the idea that music trends have evolved over time, which helps support the main question of this project.

Average Artist Popularity Over Time

This line chart shows the average artist popularity over time. This chart is effective because it clearly illustrates trends and changes over time, allowing for easy comparison across different years.

From 1920 to the 1960s, there is more fluctuation with average artist popularity. However, from the 1960s and onward, there seems to be less variation and therefore not as much change occurring. This could suggest that artist popularity was less stable during earlier periods. The smoothed trend line indicates a pattern where artist popularity generally increases from the early years into the mid-1900s, followed by a slight decline and more stability later. One disparity is the difference in stability between earlier and more recent years, where the earlier years suggest more volatility with the average artist popularity values and the recent years exhibit more consistency.

The overall trend in this line chart supports the finding that artist popularity has become more stable over time, possibly due to better technology and more musical resources.

Correlation Heatmap

This correlation heatmap displays the correlations between key quantitative variables in the dataset. This map will allow for quick comparisons of the relationships between the chosen variables. The color intensity on the right represents the strength of the correlation, with the darker shades indicating positive relationships and the lighter shades indicating weaker relationships.

The strongest relationship seems to be artist familiarity and artist popularity, which has a correlation of 0.73. This is a strong positive relationship, which means that more familiar artists also tend to be more popular. In contrast, the relationships between song duration and popularity as well as year and popularity have a correlation of zero, which indicates that there is little to no linear relationship with the artist popularity variable.

A key takeaway from this visualization is the disparity in correlation strength across variables. Although there is a strong positive association with artist familiarity and popularity, other variables such as song duration and release year exhibit much weaker relationships. This emphasizes that not all factors contribute equally to artist popularity. Overall, the correlation heatmap contributes to the main question of the project by further identifying that artist familiarity most strongly associates with artist popularity.

Distribution of Artist Popularity Over Time

The animation shows how the distribution of artist popularity changes over time using a series of histograms. An animated histogram was chosen to illustrate this because it allows for the visualization of how the distribution of one quantitative variable evolves over the course of many years. Each frame represents a different year, showing how artist popularity is distributed during that time period.

From the animation, it is clear that artist popularity is very concentrated at lower values up to the 1990s, indicating that a large proportion of artists have relatively low popularity. However, when the turn of the century occurs at 2000, there is slightly more spread and a greater presence of higher popularity values. This suggests that while most artists stay less popular, there is still an increasing number of artists reaching higher popularity levels in recent years.

One key disparity in this animation is the consistent imbalance between the low-popularity and high-popularity artists. Throughout the entire animated histogram, the majority of artists remain at lower popularity values, while only a smaller portion achieves higher popularity. This suggests that popularity is not evenly distributed and remains concentrated among a limited number of artists.

Overall, this animation supports the main question of the project how illustrating how the distribution of artist popularity has evolved over time and also highlights disparities have to do with the popularity distribution among artists.

Interactive Time Series and Distribution Analysis of Music Data

This shiny application explores how Artist Popularity, Artist Familiarity, and Song Duration change over time. The top plot displays the average value of the selected variable grouped by year or decade, and also has the option of including a smoothing trend line to highlight overall patterns. The bottom visualization is a histogram that shows the distribution of the selected variable across the selected years or decades.

Explore the Shiny app here: Interactive Time Series and Distribution Analysis of Music Data

This application serves a different purpose than the other visualizations because it presents a combination of both a times series graph and a distribution plot on one platform for the viewer. Through this interface, the user is able to examine trends and value distributions simultaneously. While earlier plots may have explored these ideas separately, this application unifies them and allows the user to go more into depth with the analysis through the use of interactivity, allowing them to switch between different variable focuses, change time groupings, and filter specific decades. More specifically, the ability to be able to switch between custom time periods sets this application apart, manages to allow the user to look at the data through a unique lens, and allows for more targeted insights that may not be accessible through other visualizations.

Decade Comparison Across Variables

This Shiny dashboard provides the user with the opportunity of comparing the different musical characteristics (Artist Popularity, Artist Familiarity, and Song Duration) across two different decades of their choice. By doing this, the user is able to see the dashboard update in real time and see the distribution (via density plot or boxplot) of the chosen variable as well as top-performing artists with the two decades. Also, the bar chart can be customized by setting the number of top artists being shown and providing a threshold for the number of minimum songs per artist. At the bottom, there is a summary of statistics table that provides the mean, median, standard deviation, minimum value, and maximum value of the chosen variable as well as the number of songs included in that decade after filtering.

Explore the Shiny app here: Decade Comparison Across Variables

As shown in the default decades chosen, the density plot shows the difference in artist popularity for the 1920s and the 2010s when the number of top artists to show is set at 5 and minimum songs per artists is at 1, revealing that modern artists tend to have higher and more varied popularity scores. Additionally, using the same filtering, the comparison of Artist Familiarity between the 1920s and 2010s shows that modern artists tend to be more widely recognized by listeners, and that the distribution for the 2010s is shifted more further to the right and appears to be more spread out. Also, for song duration, the distribution plot reveals that songs in the 2010s tend to be longer than those in the 1920s. The top artist bar chart also supports this, showing that the leading artist in the 2010ss has a significantly higher average song duration compared to the top artist in the 1920s.

This application differentiates itself from the other visualizations by being multi-layered and allowing for the comparison of different time periods between each variable. Additionally, it incorporates important statistical analyses (distributional analysis, ranking, and summary statistics) in one interface, which gives a more comprehensive understanding of the data. Ultimately, this application enables deeper exploration by letting the user control what they want to see or investigate in their analyses of the data.

Distribution of Songs Across Artist Familiarity and Popularity

This Tableau dashboard presents a density-based view of the relationship between Artist Familiarity and Artist Popularity, highlighting where songs are most concentrated within the dataset. Unlike earlier visualizations that place an emphasis on individual relationships or trends over time, this plot show the overall stricture of the data by highlighting areas with the highest density of observations.

Explore the Tableau Dashboard here: Distribution of Songs Across Artist Familiarity and Popularity

This dashboard reveals that that densest region of the plot lies with moderate levels of both familiarity and popularity, indicating that most songs tend to fall within an average range. In contrast, relatively few songs tend to appear in the higher levels of popularity, demonstrating a disparity in the distribution of successful artists. This suggests that while many artists achieve moderate recognition, only a small percentage of them reach the highest levels of popularity. This aligns with earlier findings from the scatterplot and correlation analysis, which showed a positive relationship between familiarity and popularity, but also indicated that this relationship does not necessarily extend to the highest values, or outliers. The elliptical shape of the density pattern also supports the positive association between familiarity and popularity. However, the spread of the density plot also indicates variability, which suggests that there is still the presence of other influential factors that are not captured in the dataset. Another key insight that this dashboard demonstrates is that regardless of the time period, the overall distribution of popularity remains uneven, and the rarity of extreme popularity persists as a consistent pattern.

This visualization differs from the other visuals in the project by focusing on data density rather than individual relationships or trends. While previous graphs explored the interaction of variables over time, this plot provides a broader perspective on the overall structure of the dataset. As the final visualization of the project, it reinforces the main findings of the project by illustrating that popularity is not evenly distributed and is concentrated among a limited number of artists due to the explored factors, and possibly more unknown variables.

Conclusion

The main purpose of this project was to explore which factors shape artist and song popularity and how these factors have changed over time. Through a combination of different plots, Shiny applications, a Tableau dashboard and an animation, several key patterns were identified to help explain the structure of popularity within the music industry.

One consistent finding across all analyses is the strong relationship between Artist Familiarity and Artist Popularity. Both the scatterplot and correlation heatmap demonstrated a clear positive association, suggesting that more well-known artists tend to achieve higher levels of popularity. However, there was also revealed to be variability in this relationship, which suggested that other external factors may be involved.

The visualizations dedicated to examining trends over time revealed that both Artist Popularity and song characteristics have evolved across decades. Both the boxplot and line charts suggested that Artist Popularity and Duration have drastically changed and are more variable in present time. Additionally, song duration tends to be longer with modern songs. Both of these changes suggests that the modern music landscape allows for more flexibility in terms of song characteristics and popularity levels. The Shiny applications demonstrated that while general patterns exist, the relationship between variables can vary depending on selected time periods, suggesting that popularity is influenced by a combination of factors. Finally, the Tableau dashboard revealed that most songs are concentrated within moderate levels of familiarity and popularity, showing that successful artists/music is not evenly distributed.

Overall, this analysis shows that while factors such as Artist Familiarity play a significant role in determining Artist Popularity, the broader structure of the music industry depends on complex and competitive models. Popularity is influenced by many interacting factors and stays unevenly distributed across the music industry. Morever, these findings support the idea that the music industry has evolved over time and continues to face challenges associated with achieving success.