Is there a Formula for a Hit Song?

Introduction

1.1 Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this?

Along with the development of social media, people tend to experience FOMO (69% of U.S people have experienced FOMO), which leads them to listen to music based on its popularity. Although the creative process is inherently subjective, is there a “formula” for a hit song? In this project, we aim to provide a comprehensive analytical report that not only helps to understand “market trends” but also highlights “specific opportunities” within the existing data set. This is where creators can align their sound with market trends while maintaining their unique voice, allowing them to make more informed, data-driven decisions on what to promote.

While our data set spans from 1960 to 2020, the rise of social media, particularly since 2010, has significantly reshaped listening behavior through phenomena such as FOMO. By comparing song trends before and after the social media boom, we aim to examine how popularity-driven consumption has influenced musical characteristics, offering insights into how creators can adapt in a socially amplified market.

1.2 Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed).

We have the data set for Spotify from GITHUB. We are going to identify the variables that will correlate with our problem statement. Construct a visualization(ggplot, histogram, plot graphs) with these figures & process manipulation of data in such way to analyze the information were seeking. We will visualize the data in the form of graphs to determine the most popular music artists, as well as what the audience is looking for in music: danceability, liveness, and energy as our dimensions.

1.3 Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem.

Regression analysis allows us to explore the relationship between each variable and a song’s popularity, helping identify which features have the strongest impact.

Audio features ↔︎ popularity
Genre ↔︎ popularity
Artists ↔︎ popularity
Release timing (seasonal effect) ↔︎ popularity

1.4 Explain how your analysis will help the consumer of your analysis.

Artists and producers make strategic choices to increase the reach of their music, while talent scouts identify artists with high commercial potential. Even music lovers discover hidden gems that have long been overlooked in their playlists.

Packages Required

2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis.

2.2 Messages and warnings resulting from loading the package are suppressed.

2.3 Explanation is provided regarding the purpose of each package (there are over 10,000 packages, don’t assume that I know why you loaded each package).

More “packages” can be added in the future:

library(tidyverse) - A comprehensive toolkit for data science workflows, including data import, cleaning, transformation, visualization, and integration.
library(dplyr) - Used for data manipulation.
library(tidyr) - Reshaping and organizing data.
library(ggplot2) - Create beautiful, flexible plots.
library(lubridate) - Work with dates and times.
library(knitr) - Dynamic Report Generation.
library(kableExtra) - Enhanced Table Styling

Data Preparation

3.1 Original source where the data was obtained is cited and, if possible, hyperlinked.

We will use the Spotify data set from the course material, named “spotify_songs.csv”, or tidytuesday from GitHub.
- Spotify Data Set

3.2 Source data is thoroughly explained (i.e. what was the original purpose of the data, when was it collected, how many variables did the original have, explain any peculiarities of the source data such as how missing values are recorded, or how data was imputed, etc.).

Origin: Part of the TidyTuesday weekly data project for practicing R skills.
Purpose: Designed to help users learn data wrangling and visualization using ggplot2, dplyr, tidyr, and other tidyverse tools.
Community: Created by members of the R4DS Online Learning Community, inspired by the “R for Data Science” textbook.
Source: Data collected from Spotify via the spotifyr package.
Date Created: January 21, 2020.
Authors: Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff.
Size: 32,833 records and 23 variables.
Content: Includes track metadata (e.g., artist, album, genre) and musical features (e.g., danceability, energy, valence).
Missing Values: Recorded as NA; no imputation was applied.
Use Case: Ideal for exploratory analysis, genre comparison, and building visualizations.

3.3 Data importing and cleaning steps are explained in the text (tell me why you are doing the data cleaning activities that you perform) and follow a logical process.

3.3.1 Import the dataset & view structure into Rstudio:

Import the data set into Rstudio:

spotify <- read.csv("C:/Users/samc8/OneDrive - Xavier University/Data Wrangling/Week 4/spotify_songs (2).csv")

View structure of the data set:

str(spotify)

View summary statistics of the data set:

summary(spotify)

3.3.2 Deleted Unused Variables:

playlist_subgenre: Contains 24 distinct sub-genres that introduce noise and fragmentation. Removed to avoid overfitting or misleading groupings in genre-based analysis.
playlist_id: Unique identifier with no analytical value. Dropped to reduce dimensionality and avoid clutter.
track_album_id / track_id: Technical identifiers used for database referencing, not meaningful for visualization or modeling.

Removed to streamline the data set.

spotify$playlist_id <- NULL
spotify$track_album_id <- NULL
spotify$track_id <- NULL
spotify$playlist_subgenre <- NULL

3.3.3 Renaming Variables

colnames(spotify) <- c("track_name", "track_artist", "track_popularity", "track_album_name",
                       "track_album_release_date", "playlist_name", "playlist_genre", "danceability",
                       "energy", "key", "loudness", "mode", "speech_ratio",
                       "acousticness", "instrumentalness", "liveness", "positivity",
                       "tempo", "duration_ms")

Renamed “speechiness” to speech_ratio to clarify that the variable reflects the proportion of spoken content in a track.
Renamed “valence” to positivity to make the emotional tone more intuitive and easier to interpret.

3.3.4 Issues with missing values

colSums(is.na(spotify))

Find Missing Values:

There are 15 missing values
- 5 missing values from track_name
- 5 missing values from track_artist
- 5 missing values from track_album_name

Remove Missing Values:

spotify_clean <- na.omit(spotify)

3.3.5 - Conversion of date format.

We wanted to convert the release date of the track album into a proper date format.

spotify_clean <- spotify_clean %>%
  mutate(track_album_release_date = as.Date(track_album_release_date))
  
str(spotify_clean$track_album_release_date)

Alternative (other code to consider):

spotify_clean <- spotify_clean %>%
  mutate(track_album_release_date = as.Date(track_album_release_date))

3.3.6 - Creation of new dimension.

To create new dimensions for the following variables:

release_year - Created the “release_year” column to enable year-based analysis of track trends, allowing for easier aggregation and comparison over time.
duration_min - Since song duration was originally stored in milliseconds, we created a new variable “duration_min” to express it in minutes, making comparisons and visualizations more intuitive.

spotify_clean <- spotify_clean %>%
  mutate(
    release_year = lubridate::year(track_album_release_date),
    duration_min = duration_ms / 60000, )

3.3.7.1 - Finding Outiers.

boxplot(spotify_clean$duration_min,
        main = "Boxplot of Song Duration (min)",
        ylab = "Duration (minutes)")

3.3.7.2 - Removal of Outiers.

To avoid excluding valid songs with unusually long or short durations, we apply an asymmetric threshold: 4 × IQR above the third quartile and 2 × IQR below the first quartile. This approach broadens the acceptable range while still filtering extreme values, helping preserve meaningful variation in the data set without misclassifying legitimate entries as outliers.

Q1 <- quantile(spotify_clean$duration_min, 0.25, na.rm = TRUE)
Q3 <- quantile(spotify_clean$duration_min, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
upper_bound <- Q3 + 4 * IQR
lower_bound <- Q1 - 2 * IQR

spotify_clean_2 <- spotify_clean[
  spotify_clean$duration_min >= lower_bound & spotify_clean$duration_min <= upper_bound, ]

boxplot(spotify_clean_2$duration_min,
        main = "Boxplot of Song Duration (min, no outliers)",
        ylab = "Duration (minutes)")

length(spotify_clean_2$duration_min)

After data cleaning there were 32 songs that were defined as outliers and removed from the data set.

Original data set: 32833 observations.
Cleaned data set: 32801 observations.

Boxplot of Song Duration

Description:

Interquartile ranges helps to find the outlier by providing a clear picture of the data’s spread or also known as midspread. Accessing variability in the data and understanding the distribution of the whole data set.

3.3.8 - Grouping of artist and calculation to the total popularity.

Grouped tracks by artist and calculated the “average popularity” across their songs to represent each artist’s overall popularity. Then extracted the top 10 artists with the highest overall popularity for focused analysis.

artist_popularity <- spotify_clean %>%
  group_by(track_artist) %>%                    
  summarise(total_popularity = sum(track_popularity, na.rm = TRUE),
            avg_popularity = mean(track_popularity, na.rm = TRUE),
            song_count = n()) %>%                
  arrange(desc(total_popularity))

To select and slice the top 10 rows of the artists in the data set.

top10 <- artist_popularity %>% slice_head(n = 10)

Visualization of the distribution using “ggplot” by top 10.

ggplot(top10, aes(x = reorder(track_artist, total_popularity),
                  y = total_popularity)) +
  geom_col(fill = "#1DB954") +
  coord_flip() +
  labs(title = "Top 10 Artists by Total Popularity",
       x = NULL, y = "Total Popularity") +
  theme_minimal()

Top 10 Artists by Total Popularity

3.3.9. Convert into dummy variables

At this stage, categorical variables such as genre have not been converted into dummy variables, as the current analysis does not require it. This transformation may be considered in future modeling steps if needed.

3.4 Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible.

kableExtra::scroll_box(
  kableExtra::kable_paper(
    kableExtra::kbl(head(spotify_clean_2, 10))
  ),
  width = "700px",
  height = "300px"
)

track_name	track_artist	track_popularity	track_album_name	track_album_release_date	playlist_name	playlist_genre	danceability	energy	key	loudness	mode	speech_ratio	acousticness	instrumentalness	liveness	positivity	tempo	duration_ms	release_year	duration_min
I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754	2019	3.245900
Memories - Dillon Francis Remix	Maroon 5	67	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600	2019	2.710000
All the Time - Don Diablo Remix	Zara Larsson	70	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616	2019	2.943600
Call You Mine - Keanu Silva Remix	The Chainsmokers	60	Call You Mine - The Remixes	2019-07-19	Pop Remix	pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093	2019	2.818217
Someone You Loved - Future Humans Remix	Lewis Capaldi	69	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052	2019	3.150867
Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	Pop Remix	pop	0.675	0.919	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049	2019	2.717483
Never Really Over - R3HAB Remix	Katy Perry	62	Never Really Over (R3HAB Remix)	2019-07-26	Pop Remix	pop	0.449	0.856	5	-4.788	0	0.0623	0.1870	0.00e+00	0.1760	0.152	112.648	187675	2019	3.127917
Post Malone (feat. RANI) - GATTÜSO Remix	Sam Feldt	69	Post Malone (feat. RANI) [GATTÜSO Remix]	2019-08-29	Pop Remix	pop	0.542	0.903	4	-2.419	0	0.0434	0.0335	4.80e-06	0.1110	0.367	127.936	207619	2019	3.460317
Tough Love - Tiësto Remix / Radio Edit	Avicii	68	Tough Love (Tiësto Remix)	2019-06-14	Pop Remix	pop	0.594	0.935	8	-3.562	1	0.0565	0.0249	4.00e-06	0.6370	0.366	127.015	193187	2019	3.219783
If I Can’t Have You - Gryffin Remix	Shawn Mendes	67	If I Can’t Have You (Gryffin Remix)	2019-06-20	Pop Remix	pop	0.642	0.818	2	-4.552	1	0.0320	0.0567	0.00e+00	0.0919	0.590	124.957	253040	2019	4.217333

3.5 Provide summary information about the variables of concern in your cleaned data set. Do not just print off a bunch of code chunks with str(), summary(), etc. Rather, provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.

summary(spotify_clean_2[,10:19])

table(spotify_clean_2$release_year)

Proposed - Exploratory Data Analysis (EDA)

4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?

Based on the popularity of the top artists

Does the artist’s popularity affect the music’s popularity?
Do audio features (danceability, energy, valence, tempo, loudness, etc.) significantly relate to popularity?
Which audio features (danceability, energy, valence, tempo, loudness, etc.) have the strongest relationship with popularity?
Do instrumental or acoustic songs perform worse than vocal or electronic songs?
Does speechiness (rap-like lyrics) correlate positively or negatively with popularity?
Does release timing (year) influence popularity? (based on two different time periods: the difference between 2010-2020 and 2010 backward into the past.)?

4.2 What types of plots and tables will help you to illustrate the findings to your questions?

We plan to utilize various tables and plots to visualize our data.

We would like to use a “Bar Chart” to rank the top 10 artists by total popularity. We can also visualize the top 10 least popular artists in the same fashion.
We will also utilize other charts such as “Correlation Plot” to see similar attributes within the data set. “Boxplots” to visualize outliers and visualize two factors being compared to one another. Finally, we plan on utilizing “Heat Maps” as a visualization, where the color of each variable will stand out compared to other attributes within the data. This will easily distinguish the attributes with one another.

4.3 What do you not know how to do right now that you need to learn to answer your questions?

We want to learn how to use correlation plots and heat maps to better visualize relationships between musical features and song popularity.
Handle year and time-related variables effectively, whether to group songs by decade, segment by pre/post social media era, or account for delayed popularity trends.
Apply clustering methods (e.g., K-means or hierarchical clustering) to group songs based on musical characteristics.
Implement Principal Component Analysis (PCA) to reduce dimensionality and visualize the underlying structure of our data set.
We are unsure how to measure artist popularity without bias from song count. We need to learn how to design a fair composite or normalized metric.

4.4 Do you plan on incorporating any machine learning techniques (i.e. linear regression, discriminant analysis, cluster analysis) to answer your questions?

We plan to use linear regression, to identify the key factors that influence a song’s popularity, both across different time periods and specifically within the top 10 most popular artists. This focus allows us to uncover general trends in musical success while also analyzing the unique characteristics and strategies of leading artists.
Linear regression provides an interpretable coefficient from data involving statistical tests and intervals. Additionally, we can alternatively incorporate other ML algorithms such as K-means, linear vectors, and decision trees to predict and influence consumer’s taste in music.