The aim of this project is to develop a machine learning model to identify the most streamed Spotify tracks in 2023, regardless of their presence on the official Spotify charts. The data for this project will be gathered exclusively from Kaggle. These datasets will be detailed further along with their respective citations. Multiple machine learning techniques will be applied to create the most accurate prediction model for this task.
Understanding the factors that contribute to a song’s popularity on streaming platforms like Spotify is invaluable for artists, producers, and record labels. By predicting which tracks will be most streamed, stakeholders in the music industry can make more informed decisions regarding marketing strategies, playlist placements, and promotional efforts. Additionally, this model could help new artists and emerging talents understand the key attributes that contribute to a song’s success, potentially guiding their creative processes. Moreover, for music enthusiasts and data scientists, this project offers a fascinating glimpse into the intersection of music and data analytics, highlighting how machine learning can provide insights into cultural trends.
The data was exclusively collected from Kaggle and contains a comprehensive list of the most famous songs of 2023 as listed on Spotify. This dataset offers a wealth of features beyond what is typically available in similar datasets. It provides insights into each song’s attributes, popularity, and presence on various music platforms. The dataset includes information such as:rack name, artist(s) name, release date, Spotify playlists and charts, streaming statistics, Apple Music presence, Deezer presence, Shazam charts, and various audio features. The dataset can be downloaded from https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
In order to fulfill our goal of achieving a better prediction, we have a detailed plan on how to deal with the data:
1.We will clean our data by removing any of missing observations, as
well as remove any predictor variables that are unnecessary for
understanding.
2.Based on remaining variables, we will draw some graphs and plots on
them to develop a better impression on how they related to a song’s
presence on the official Spotify charts.
3. We will then perform a training/test split on our data, make a
recipe, and set folds for the 10-fold cross validation we will
implement. Logistic Regression, Random Forest,Gradient-Boosted-Trees,
LDA, QDA and KNN models will be all used to deal withthe the training
data when we finish the setup.
4. After we have results, we will select the best one based on their
roc_auc score, and then fit it to our test data to make a prediction on
“Yes” or “No”.
we will load our data first and use clean_names() to make every
variables standardized.
## loading the data
spotify_2023_raw<-read_csv("/Users/zhaolei/Desktop/131 final project/spotify-2023.csv")
## cleaning predictor names
spotify_2023_cleaned <- clean_names(spotify_2023_raw)
to better visualize the dataset, I create a table listed below and I
also provide the codebook that enable readers to gain a better
understanding of it.
# Using DT to view our data in table
datatable(
head(spotify_2023_cleaned,50),
options = list(
pageLength = 10,
autoWidth = TRUE,
scrollX = TRUE
),
class = 'display compact'
)
track_name: Name of the song
artist_name: Name of the artist(s) of the song
artist_count: Number of artists contributing to the
song
released_year: Year when the song was released
released_month: Month when the song was released
released_day: Day of the month when the song was
released
in_spotify_playlists: Number of Spotify playlists the
song is included in
in_spotify_charts: Presence and rank of the song on
Spotify charts
streams: Total number of streams on Spotify
in_apple_playlists: Number of Apple Music playlists the
song is included in
in_apple_charts: Presence and rank of the song on Apple
Music charts
in_deezer_playlists: Number of Deezer playlists the
song is included in
in_deezer_charts: Presence and rank of the song on
Deezer charts
in_shazam_charts: Presence and rank of the song on
Shazam charts
bpm: Beats per minute, a measure of song tempo
key: Key of the song
mode: Mode of the song (major or minor)
danceability_%: Percentage indicating how suitable the
song is for dancing
valence_%: Positivity of the song’s musical
content
energy_%: Perceived energy level of the song
acousticness_%: Amount of acoustic sound in the
song
instrumentalness_%: Amount of instrumental content in
the song
liveness_%: Presence of live performance elements
speechiness_%: Amount of spoken words in the song
### Visually Checking Missing Data
vis_miss(spotify_2023_cleaned)
spotify_2023_updated<-spotify_2023_cleaned%>%
select(-in_shazam_charts)
With 5 percent of missing data in variable in_shazam_chart, we do have to remove it. Also, since it’s not a major platform, so we don’t have to worry too much about its influence on our result. Also, since we do not concern abbout value of Key, so we will remove it later when we do variable selection.
Data Transformation:
Converts in_spotify_charts to a factor with levels “No” and “Yes”.
Converts in_apple_charts and in_deezer_charts to binary factors (0 and
1).
Converts mode to a binary factor where “Major” is 0 and any other value
is 1.
Scales down streams by dividing by 1,000,000.
Data Cleaning:
Removes rows with any missing values.
Variable Selection:
Selects a specific set of predictor variables for further analysis and
creates a new data frame spotify_2023_updated.
## creating dummy variable for predictor and response
spotify_2023_cleaned <- spotify_2023_cleaned %>%
mutate(
in_spotify_charts = as.factor(ifelse(in_spotify_charts == 0, "No", "Yes")),
in_apple_charts = as.factor(ifelse(in_apple_charts == 0, 0, 1)),
in_deezer_charts = as.factor(ifelse(in_deezer_charts == 0, 0, 1)),
mode = as.factor(ifelse(mode == "Major", 0, 1)),
streams = as.numeric(streams) / 1000000 # we make this smaller
) %>%
filter(complete.cases(.))
## Variable Selection
predictor_of_spotifychart<-c("artist_count","in_spotify_charts","in_spotify_playlists","streams","in_apple_playlists","in_apple_charts","in_deezer_playlists","in_deezer_charts","bpm","mode","danceability_percent","valence_percent","energy_percent","liveness_percent")
spotify_2023_updated <- spotify_2023_cleaned %>%
select(any_of(predictor_of_spotifychart))
# Using DT to view our data in table
datatable(
head(spotify_2023_updated,50),
options = list(
pageLength = 10,
autoWidth = TRUE,
scrollX = TRUE
),
class = 'display compact'
)
To gain a clearer understanding of the distribution of our response
variable and predictors, we will:
1.Generate a plot to visualize the response
variable.
2.Create a correlation matrix to identify potential
correlations among the predictor variables.
3.Produce visualization plots to explore the impact
of selected predictor variables on the response variable.
Before we delve deeper into building our models, it’s important to note that songs appearing on the Spotify charts are significantly outnumbered by those that do not. This imbalance is expected, since most streaamed songs have very high likelihhod to get into the charts.
spotify_2023_updated%>%
ggplot(aes(x=in_spotify_charts))+
geom_bar()+
labs(x="whether in spotify charts", y="number of songs")
As we can see, “Yes” part is slightly more than “No”.Therefore, our
assumption to the spotify music is reasonable.
In this part, we’ll make a correlation matrix and then make a heat map of the correlation of these predictors.
# making a correlation matrix and heat map of the predictors
spotify_2023_updated %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(type = 'full', diag = F,
method = 'number')
In this heatmap, it’s not surprising that the upper left corner has the
most ambiguous correlation values since most of them are intuitively
related to each other. Additionally, the correlations in the lower right
corner are also quite significant. However, I did not expect that the
percentage correlations between streams and the variables below it would
not be very significant, especially since those variables are crucial
elements for making the music more popular.
The dataset shows that bpm values between 90-110 and 120-140 are the most frequent. This indicates that many songs, regardless of their popularity, tend to fall within these tempo ranges.Both charting and non-charting songs are heavily represented in these bpm ranges. This implies that while these tempos are popular, they are not the sole determining factor for a song’s success on the Spotify charts.The prevalence of these bpm ranges among both charting and non-charting songs suggests that these tempos might be characteristic of mainstream music genres. These genres could include pop, dance, and other popular music styles that often feature bpm within these ranges.
ggplot(spotify_2023_updated, aes(x = bpm, fill = in_spotify_charts)) +
geom_bar() +
labs(x = "bpm", y = "Count of Songs", title = "Distribution of Songs by bpm") +
scale_fill_manual(values = c("#0072B2", "#D55E00")) + # Custom colors for "yes" and "no"
theme_minimal()
The most frequently occurring danceability percentage in the dataset is between 70% and 80%. This range contains the highest count of songs, indicating a preference for this level of danceability in general music production.This characteristic is crucial for songs to appeal to a broader audience and perform well on music charts
ggplot(spotify_2023_updated, aes(x = danceability_percent, fill = in_spotify_charts)) +
geom_bar() +
labs(x = "danceability", y = "Count of Songs", title = "Distribution of Songs by danceability") +
scale_fill_manual(values = c("#0072B2", "#D55E00")) + # Custom colors for "yes" and "no"
theme_minimal()
The box plot for songs in the Spotify charts (“yes”) is larger than for songs not in the charts (“no”),indicating a higher variability in the number of streams for charting songs. The median (represented by the line inside the box) for the “yes” category is higher compared to the “no” category, showing that charting songs generally have a higher number of streams. There are more extreme values (outliers) for the “yes” category. These are songs with exceptionally high numbers of streams, which significantly exceed the typical range
ggplot(spotify_2023_updated, aes(x = in_spotify_charts, y = as.numeric(streams), fill = in_spotify_charts)) +
geom_boxplot() +
labs(x = "In Spotify Charts", y = "Streams", title = "Distribution of Streams by Spotify Charts") +
scale_fill_manual(values = c("#0072B2", "#D55E00"))+
theme_minimal()
We can now proceed to building our models. First, we will randomly split our data into training and testing sets. Next, we will set up and create our recipe, and finally, we will establish cross-validation within our models.
Our first step is to split the data into separate datasets. One
dataset will be used for training our models, while the other will be
reserved as the testing set, which will only be used once when we
actually test our models.
First, we need to set a seed to ensure that our random split can be
reproduced every time we train our models.
Next, we perform a training/testing split on our data and stratify
based on our response variable, in_spotify_charts, to ensure that the
distribution of this variable is maintained in both the training and
testing sets.
set.seed(0926)
spotify_split<-initial_split(spotify_2023_updated,prop=0.75,strata = in_spotify_charts)
spotify_train<-training(spotify_split)
spotify_test<-testing(spotify_split)
set.seed(0926)
train_dim<-dim(spotify_train)
train_dim
## [1] 611 14
For Data splitting, I chose a proportion of 0.75. From the output of dimension we know that training data is 611 which is pretty large for our project.
We will now bring together our predictors and our response variable to build our recipe, which we will use for all the models. This recipe will ensure that our data is consistently preprocessed for model training and evaluation.
spotify_recipe <- recipe(in_spotify_charts ~ ., data = spotify_train) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_scale(all_numeric_predictors()) %>%
step_center(all_numeric_predictors())
We will stratify our cross-validation on our response variable, in_spotify_charts, and use 10 folds to perform stratified cross-validation. Also, we will save the results to an RDA file. This allows us to load the models later without needing to rebuild them, ensuring efficient use of our time and resources.
set.seed(0926)
spotify_folds<-vfold_cv(spotify_2023_updated,v=10,strata=in_spotify_charts)
save(spotify_recipe,spotify_folds, spotify_recipe, spotify_train, spotify_test, file = "/Users/zhaolei/Desktop/131 final project/spotify.rda")
In the last part we will build our prediction models. However, those
models take quite long time to compute, so we will load those models
from separate R file rather than run them in this R markdown file in
order to save more time. In those R file, we have already loaded the
data so we don’t have to run it again in the Rmarkdown file.
We use 6 models in total, including random forest, gradient-boosted trees,LDA,QDA,KNN, and logistic regression. the last four are quite simple but the first two are relatively hard, and we are more interested in the last two since they are normally used to deal with classification problem.
The two most useful tools to evaluate performance are accuracy and ROC AUC, and we will focus more on ROC AUC in the later part. The reason for emphasizing ROC AUC is due to the unbalanced nature of our dataset. In our dataset, the number of songs that are on the Spotify charts (Yes) is likely much more than the number of songs that are not (No). This imbalance can make accuracy a misleading metric because a model could achieve high accuracy by simply predicting the majority class for all instances
Here we load from other R files so we can save more time when knitting html file.
load("/Users/zhaolei/Desktop/131 final project/logistic_regression.rda")
load("/Users/zhaolei/Desktop/131 final project/random_forest_spotify.rda")
load("/Users/zhaolei/Desktop/131 final project/Gradient_boosted_trees.rda")
load("/Users/zhaolei/Desktop/131 final project/knn_model.rda")
load("/Users/zhaolei/Desktop/131 final project/LDA.rda")
load("/Users/zhaolei/Desktop/131 final project/QDA.rda")
The model-building process for most machine learning models tends to
follow a similar structure. However, models like Logistic Regression,
Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis
(QDA) are simpler and quicker to train, and thus, have a slightly
different process. Here is the general workflow for building our
models:
Model Specification:
We start by specifying the type of model we want to build, setting
the engine, and defining the mode. For our project, the mode is always
set to ‘classification’ because our goal is to classify whether a song
will be on the Spotify charts or not.
Workflow Setup:
We create a workflow, add the specified model, and incorporate our
established Spotify recipe into this workflow. The recipe includes data
preprocessing steps such as dummy variable creation and scaling.
Simpler Models:
For simpler models like Logistic Regression, LDA, and QDA, we skip
the steps involving hyperparameter tuning (steps 4-6) since these models
typically do not require extensive hyperparameter tuning.
Tuning Grid Setup:
For more complex models, we set up a tuning grid with the parameters
that we want to tune. We define the range and levels for each parameter
to explore different combinations.
Model Tuning:
We tune the model using the specified hyperparameters. This involves
training multiple versions of the model with different parameter
settings to find the best combination.
Model Selection:
After tuning, we select the model that performed the best based on
our chosen evaluation metric (ROC AUC). This step ensures that we are
using the most effective model configuration.
Finalizing the Workflow:
We finalize the workflow with the best tuning parameters identified
in the previous step. This step ensures that the final model
configuration is optimized for performance.
Model Fitting:
We fit the finalized model to our training dataset using the
workflow. This step trains the model on the entire training dataset with
the optimal parameters.
Here is the coding for our six models, but we don’t run them when we knit the file only show how we set them up, so we state eval=F when we set up the code chunk to save the knitting time.
## Logistic Regression
log_reg_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
log_reg_wf <- workflow() %>%
add_model(log_reg_spec) %>%
add_recipe(spotify_recipe)
log_reg_fit <- fit_resamples(log_reg_wf, resamples = spotify_folds,
control = control_resamples(save_pred = TRUE))
## Random Forest
rf_model <- rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rf_wkflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(spotify_recipe)
rf_grid <- grid_regular(mtry(range = c(1, 5)),
trees(range = c(100, 500)),
min_n(range = c(2, 5)),
levels = 5)
rf_res <- tune_grid(
rf_wkflow,
resamples = spotify_folds,
grid = rf_grid,
control = control_grid(save_pred = TRUE)
)
## KNN
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
knn_wflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(spotify_recipe)
knn_grid <- grid_regular(neighbors(range = c(1, 20)), levels = 10)
knn_res <- tune_grid(
knn_wflow,
resamples = spotify_folds,
grid = knn_grid,
control = control_grid(save_pred = TRUE)
)
## Gradient Boosted Trees
bt_class_spec <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
bt_class_wf <- workflow() %>%
add_model(bt_class_spec) %>%
add_recipe(spotify_recipe)
bt_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(200, 600)),
learn_rate(range = c(-10, -1)),
levels = 5)
tune_bt_class <- tune_grid(
bt_class_wf,
resamples = spotify_folds,
grid = bt_grid
)
### LDA
lda_mod <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
LDA_wkflow <- workflow() %>%
add_model(lda_mod) %>%
add_recipe(spotify_recipe)
LDA_fit <- fit_resamples(LDA_wkflow, spotify_folds)
## QDA
qda_mod <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
QDA_wkflow <- workflow() %>%
add_model(qda_mod) %>%
add_recipe(spotify_recipe)
QDA_fit <- fit_resamples(QDA_wkflow, spotify_folds)
To summarize the best ROC AUC values from our seven models, we will create a tibble to display the estimated final ROC AUC value for each fitted model.
## logistic regression
log_roc_auc_score <- log_reg_fit%>%collect_metrics()%>%
slice(3)
## Random Forest
rf_metrics <- collect_metrics(rf_res)
roc_auc_scores_1 <- rf_metrics %>%filter(.metric == "roc_auc") %>%
arrange(desc(mean)) %>%
select(mean)
rf_roc_auc_score<-roc_auc_scores_1$mean[1]
## Gradient Boosted Trees
bt_metrics <- collect_metrics(tune_bt_class)
roc_auc_scores_2 <- bt_metrics %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean)) %>%
select(mean)
bt_roc_auc_score<-roc_auc_scores_2$mean[1]
## KNN
KNN_metrics <- collect_metrics(knn_res)
roc_auc_scores_3 <- KNN_metrics %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean)) %>%
select(mean)
knn_roc_auc_score<-roc_auc_scores_3$mean[1]
## LDA
lda_roc_auc_score <- LDA_fit %>%collect_metrics()%>%
slice(3)
## QDA
qda_roc_auc_score <- QDA_fit %>%collect_metrics()%>%
slice(3)
## visualize the values
spotify_roc_aucs <- c(log_roc_auc_score$mean,
rf_roc_auc_score,
bt_roc_auc_score,
knn_roc_auc_score,
lda_roc_auc_score$mean,
qda_roc_auc_score$mean)
spotify_mod_names <- c("Logistic Regression",
"Random Forest",
"Boosted Trees",
"KNN",
"LDA",
"QDA"
)
spotify_tibble<-tibble("Model"=spotify_mod_names,
"Values"=spotify_roc_aucs)%>%
dplyr::arrange(-spotify_roc_aucs )%>%print()
## # A tibble: 6 × 2
## Model Values
## <chr> <dbl>
## 1 Random Forest 0.823
## 2 Boosted Trees 0.815
## 3 Logistic Regression 0.808
## 4 LDA 0.807
## 5 QDA 0.777
## 6 KNN 0.764
As we can see in our tibble, the Random Forest model performed the best overall with a ROC AUC score of 0.823. Gradient boosted trees has the second higheat value 0.8146. Since these values only fit to the training data, so we have to perform them to our testing data. in the later part, we will use Random Forest model to predict the testing data.
autoplot(rf_res,metric = "roc_auc")
For the random forest, we tuned the the minimal node size, the number of
randomly selected predictors, and the number of trees. From the ouput we
can see that the optimal node size was 1, present in the middle plot,
with 400 trees, and 4 randomly selected predictors.
From our ourput table we can see that Random Forest #062 has the best performance from all random forest model.
show_best(rf_res, metric = "roc_auc") %>%
select(-.estimator, .config) %>%
slice(1)
## # A tibble: 1 × 8
## mtry trees min_n .metric mean n std_err .config
## <int> <int> <int> <chr> <dbl> <int> <dbl> <chr>
## 1 2 300 4 roc_auc 0.823 10 0.0108 Preprocessor1_Model062
Here we plot the ROC score, The closer the curve is to the top left corner, the better the model’s AUC score. Although our ROC curve does not perfectly reach the top left corner, it trends in that direction, indicating good model performance. This confirms the AUC score we calculated earlier and shows that our model performs well.
best_rf_class <- select_best(rf_res, metric = "roc_auc")
final_rf_wkflow <- finalize_workflow(rf_wkflow, best_rf_class)
final_rf_fit <- fit(final_rf_wkflow, data = spotify_train)
rf_predict_augmented <- augment(final_rf_fit, new_data = spotify_test, type = 'prob')
rf_roc_predict_auc_score <- rf_predict_augmented %>%
roc_auc(truth = in_spotify_charts, .pred_No) %>%
select(.estimate)
print(rf_roc_predict_auc_score)
## # A tibble: 1 × 1
## .estimate
## <dbl>
## 1 0.822
augment(final_rf_fit, new_data = spotify_test, type = 'prob')%>%
roc_curve(in_spotify_charts, .pred_No) %>%
autoplot()
Now we will test how useful our random forest model is in predicting whether the song is in spotify charts or not.
In this part, We create a new dataset with specific values for these predictors to simulate a song with high potential to be on the Spotify charts. Next, we select the best parameters for our Random Forest model based on previous tuning results. We finalize our workflow with these parameters and fit the model using our training data. Then, we use the fitted model to predict whether our simulated song would appear on the Spotify charts.
predictor_of_Yes <-predictor_of_spotifychart<-c("artist_count","in_spotify_playlists","streams","in_apple_playlists","in_apple_charts","in_deezer_playlists","in_deezer_charts","bpm","mode","danceability_percent","valence_percent","energy_percent","liveness_percent")
values_of_Yes <- c(2,2651,304.118600,21,1,32,1,94,0,89,61,66,36)
predicting_Yes <- as_tibble(as.data.frame(matrix(values_of_Yes, nrow = 1)))
names(predicting_Yes) <- predictor_of_Yes
predicting_Yes <- predicting_Yes %>%
mutate(
in_apple_charts = factor(in_apple_charts, levels = c(0, 1)),
in_deezer_charts = factor(in_deezer_charts, levels = c(0, 1)),
mode = factor(mode, levels = c(0, 1))
)
best_rf_class <- tibble(
mtry = 2,
trees = 300,
min_n = 4
)
final_rf_wkflow <- finalize_workflow(rf_wkflow, best_rf_class)
final_rf_fit <- fit(final_rf_wkflow, data = spotify_train)
predict(final_rf_fit, new_data = predicting_Yes, type = "class")
## # A tibble: 1 × 1
## .pred_class
## <fct>
## 1 Yes
Similarly, we create another tibble (predicting_No) with values that suggest a lower likelihood of charting. This dataset undergoes the same formatting process as the first one.
predictor_of_No <-predictor_of_spotifychart<-c("artist_count","in_spotify_playlists","streams","in_apple_playlists","in_apple_charts","in_deezer_playlists","in_deezer_charts","bpm","mode","danceability_percent","valence_percent","energy_percent","liveness_percent")
values_of_No <- c(2,4260,1065.580332,113,1,259,0,120,1,65,80,86,19)
predicting_No <- as_tibble(as.data.frame(matrix(values_of_No, nrow = 1)))
names(predicting_No) <- predictor_of_No
predicting_No <- predicting_No %>%
mutate(
in_apple_charts = factor(in_apple_charts, levels = c(0, 1)),
in_deezer_charts = factor(in_deezer_charts, levels = c(0, 1)),
mode = factor(mode, levels = c(0, 1))
)
predict(final_rf_fit, new_data = predicting_No, type = "class")
## # A tibble: 1 × 1
## .pred_class
## <fct>
## 1 No
As we can see, the model correctly predicted whether the song would appear on the Spotify charts. This is very encouraging, as it shows that our model is effective. This result reassure us practical utility of our model.
augment(final_rf_fit, new_data = spotify_test, type = 'prob') %>%
accuracy(in_spotify_charts, .pred_class) %>%
select(.estimate)
## # A tibble: 1 × 1
## .estimate
## <dbl>
## 1 0.732
Our random forest model was able to predict beautiful sunsets in our testing data with about 73.1% accuracy.
The project successfully developed predictive models that can identify songs likely to appear on the Spotify charts based on 2023 data. This predictive capability provides valuable insights into factors contributing to song popularity on streaming platforms. Stakeholders in the music industry can leverage these models to optimize promotional strategies, playlist placements, and artist collaborations to maximize song visibility and streaming performance. Future iterations of this project could explore more advanced modeling techniques, incorporate real-time data for ongoing predictions, and expand the scope to include more diverse datasets and platforms beyond American platform.
To further improve the model’s accuracy, incorporating additional data sources such as social media trends and listener demographics could be highly beneficial. Social media trends, including the number of mentions, shares, and likes a song receives across platforms like Twitter, Instagram, and TikTok, can provide real-time indicators of a track’s popularity and potential for virality. Listener demographics, including age, gender, location, and listening habits, can offer deeper insights into the target audience and how different segments engage with music. Additionally, implementing more advanced techniques like deep learning and natural language processing (NLP) for analyzing song lyrics and sentiments could reveal patterns in lyrical content, emotional tone, and thematic elements that resonate with listeners. These methods can help in identifying the types of lyrics that drive engagement and streaming counts. Furthermore, exploring the impact of marketing strategies and playlist placements, such as the effect of being featured on popular playlists or receiving high-profile endorsements, could offer a more comprehensive understanding of what drives a song’s success on streaming platforms. This holistic approach can provide a richer, multi-dimensional view of the factors contributing to a track’s popularity, enabling more accurate predictions and actionable insights for artists and industry stakeholders .
source of dataset:https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023