Along with the development of social media, people tend to experience FOMO (69% of U.S people have experienced FOMO), which leads them to listen to music based on its popularity. Although the creative process is inherently subjective, is there a “formula” for a hit song? In this project, we aim to provide a comprehensive analytical report that not only helps to understand “market trends” but also highlights “specific opportunities” within the existing data set. This is where creators can align their sound with market trends while maintaining their unique voice, allowing them to make more informed, data-driven decisions on what to promote.
While our data set spans from 1960 to 2020, the rise of social media, particularly since 2010, has significantly reshaped listening behavior through phenomena such as FOMO. By comparing song trends before and after the social media boom, we aim to examine how popularity-driven consumption has influenced musical characteristics, offering insights into how creators can adapt in a socially amplified market.
We have the data set for Spotify from GITHUB. We are going to identify the variables that will correlate with our problem statement. Construct a visualization(ggplot, histogram, plot graphs) with these figures & process manipulation of data in such way to analyze the information were seeking. We will visualize the data in the form of graphs to determine the most popular music artists, as well as what the audience is looking for in music: danceability, liveness, and energy as our dimensions.
Regression analysis allows us to explore the relationship between each variable and a song’s popularity, helping identify which features have the strongest impact.
Artists and producers make strategic choices to increase the reach of their music, while talent scouts identify artists with high commercial potential. Even music lovers discover hidden gems that have long been overlooked in their playlists.
More “packages” can be added in the future:
Import the data set into Rstudio:
spotify <- read.csv("C:/Users/samc8/OneDrive - Xavier University/Data Wrangling/Week 4/spotify_songs (2).csv")
View structure of the data set:
str(spotify)
View summary statistics of the data set:
summary(spotify)
Removed to streamline the data set.
spotify$playlist_id <- NULL
spotify$track_album_id <- NULL
spotify$track_id <- NULL
spotify$playlist_subgenre <- NULL
colnames(spotify) <- c("track_name", "track_artist", "track_popularity", "track_album_name",
"track_album_release_date", "playlist_name", "playlist_genre", "danceability",
"energy", "key", "loudness", "mode", "speech_ratio",
"acousticness", "instrumentalness", "liveness", "positivity",
"tempo", "duration_ms")
colSums(is.na(spotify))
spotify_clean <- na.omit(spotify)
We wanted to convert the release date of the track album into a proper date format.
spotify_clean <- spotify_clean %>%
mutate(track_album_release_date = as.Date(track_album_release_date))
str(spotify_clean$track_album_release_date)
Alternative (other code to consider):
spotify_clean <- spotify_clean %>%
mutate(track_album_release_date = as.Date(track_album_release_date))
spotify_clean <- spotify_clean %>%
mutate(
release_year = lubridate::year(track_album_release_date),
duration_min = duration_ms / 60000, )
boxplot(spotify_clean$duration_min,
main = "Boxplot of Song Duration (min)",
ylab = "Duration (minutes)")
To avoid excluding valid songs with unusually long or short durations, we apply an asymmetric threshold: 4 × IQR above the third quartile and 2 × IQR below the first quartile. This approach broadens the acceptable range while still filtering extreme values, helping preserve meaningful variation in the data set without misclassifying legitimate entries as outliers.
Q1 <- quantile(spotify_clean$duration_min, 0.25, na.rm = TRUE)
Q3 <- quantile(spotify_clean$duration_min, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
upper_bound <- Q3 + 4 * IQR
lower_bound <- Q1 - 2 * IQR
spotify_clean_2 <- spotify_clean[
spotify_clean$duration_min >= lower_bound & spotify_clean$duration_min <= upper_bound, ]
boxplot(spotify_clean_2$duration_min,
main = "Boxplot of Song Duration (min, no outliers)",
ylab = "Duration (minutes)")
length(spotify_clean_2$duration_min)
After data cleaning there were 32 songs that were defined as outliers and removed from the data set.
Grouped tracks by artist and calculated the “average popularity” across their songs to represent each artist’s overall popularity. Then extracted the top 10 artists with the highest overall popularity for focused analysis.
artist_popularity <- spotify_clean %>%
group_by(track_artist) %>%
summarise(total_popularity = sum(track_popularity, na.rm = TRUE),
avg_popularity = mean(track_popularity, na.rm = TRUE),
song_count = n()) %>%
arrange(desc(total_popularity))
To select and slice the top 10 rows of the artists in the data set.
top10 <- artist_popularity %>% slice_head(n = 10)
Visualization of the distribution using “ggplot” by top 10.
ggplot(top10, aes(x = reorder(track_artist, total_popularity),
y = total_popularity)) +
geom_col(fill = "#1DB954") +
coord_flip() +
labs(title = "Top 10 Artists by Total Popularity",
x = NULL, y = "Total Popularity") +
theme_minimal()
At this stage, categorical variables such as genre have not been converted into dummy variables, as the current analysis does not require it. This transformation may be considered in future modeling steps if needed.
kableExtra::scroll_box(
kableExtra::kable_paper(
kableExtra::kbl(head(spotify_clean_2, 10))
),
width = "700px",
height = "300px"
)
| track_name | track_artist | track_popularity | track_album_name | track_album_release_date | playlist_name | playlist_genre | danceability | energy | key | loudness | mode | speech_ratio | acousticness | instrumentalness | liveness | positivity | tempo | duration_ms | release_year | duration_min |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 | 2019 | 3.245900 |
| Memories - Dillon Francis Remix | Maroon 5 | 67 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 | 2019 | 2.710000 |
| All the Time - Don Diablo Remix | Zara Larsson | 70 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 | 2019 | 2.943600 |
| Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 | 2019 | 2.818217 |
| Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 | 2019 | 3.150867 |
| Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 | 2019 | 2.717483 |
| Never Really Over - R3HAB Remix | Katy Perry | 62 | Never Really Over (R3HAB Remix) | 2019-07-26 | Pop Remix | pop | 0.449 | 0.856 | 5 | -4.788 | 0 | 0.0623 | 0.1870 | 0.00e+00 | 0.1760 | 0.152 | 112.648 | 187675 | 2019 | 3.127917 |
| Post Malone (feat. RANI) - GATTÜSO Remix | Sam Feldt | 69 | Post Malone (feat. RANI) [GATTÜSO Remix] | 2019-08-29 | Pop Remix | pop | 0.542 | 0.903 | 4 | -2.419 | 0 | 0.0434 | 0.0335 | 4.80e-06 | 0.1110 | 0.367 | 127.936 | 207619 | 2019 | 3.460317 |
| Tough Love - Tiësto Remix / Radio Edit | Avicii | 68 | Tough Love (Tiësto Remix) | 2019-06-14 | Pop Remix | pop | 0.594 | 0.935 | 8 | -3.562 | 1 | 0.0565 | 0.0249 | 4.00e-06 | 0.6370 | 0.366 | 127.015 | 193187 | 2019 | 3.219783 |
| If I Can’t Have You - Gryffin Remix | Shawn Mendes | 67 | If I Can’t Have You (Gryffin Remix) | 2019-06-20 | Pop Remix | pop | 0.642 | 0.818 | 2 | -4.552 | 1 | 0.0320 | 0.0567 | 0.00e+00 | 0.0919 | 0.590 | 124.957 | 253040 | 2019 | 4.217333 |
summary(spotify_clean_2[,10:19])
table(spotify_clean_2$release_year)
Based on the popularity of the top artists
Does the artist’s popularity affect the music’s popularity?
Do audio features (danceability, energy, valence, tempo, loudness, etc.) significantly relate to popularity?
Which audio features (danceability, energy, valence, tempo, loudness, etc.) have the strongest relationship with popularity?
Do instrumental or acoustic songs perform worse than vocal or electronic songs?
Does speechiness (rap-like lyrics) correlate positively or negatively with popularity?
Does release timing (year) influence popularity? (based on two different time periods: the difference between 2010-2020 and 2010 backward into the past.)?
We plan to utilize various tables and plots to visualize our data.
We would like to use a “Bar Chart” to rank the top 10 artists by total popularity. We can also visualize the top 10 least popular artists in the same fashion.
We will also utilize other charts such as “Correlation Plot” to see similar attributes within the data set. “Boxplots” to visualize outliers and visualize two factors being compared to one another. Finally, we plan on utilizing “Heat Maps” as a visualization, where the color of each variable will stand out compared to other attributes within the data. This will easily distinguish the attributes with one another.
We want to learn how to use correlation plots and heat maps to better visualize relationships between musical features and song popularity.
Handle year and time-related variables effectively, whether to group songs by decade, segment by pre/post social media era, or account for delayed popularity trends.
Apply clustering methods (e.g., K-means or hierarchical clustering) to group songs based on musical characteristics.
Implement Principal Component Analysis (PCA) to reduce dimensionality and visualize the underlying structure of our data set.
We are unsure how to measure artist popularity without bias from song count. We need to learn how to design a fair composite or normalized metric.
We plan to use linear regression, to identify the key factors that influence a song’s popularity, both across different time periods and specifically within the top 10 most popular artists. This focus allows us to uncover general trends in musical success while also analyzing the unique characteristics and strategies of leading artists.
Linear regression provides an interpretable coefficient from data involving statistical tests and intervals. Additionally, we can alternatively incorporate other ML algorithms such as K-means, linear vectors, and decision trees to predict and influence consumer’s taste in music.