131 FInal Project: Predicting 2023 Spotify Charts

Introduction

The aim of this project is to develop a machine learning model to identify the most streamed Spotify tracks in 2023, regardless of their presence on the official Spotify charts. The data for this project will be gathered exclusively from Kaggle. These datasets will be detailed further along with their respective citations. Multiple machine learning techniques will be applied to create the most accurate prediction model for this task.

Motives

Understanding the factors that contribute to a song’s popularity on streaming platforms like Spotify is invaluable for artists, producers, and record labels. By predicting which tracks will be most streamed, stakeholders in the music industry can make more informed decisions regarding marketing strategies, playlist placements, and promotional efforts. Additionally, this model could help new artists and emerging talents understand the key attributes that contribute to a song’s success, potentially guiding their creative processes. Moreover, for music enthusiasts and data scientists, this project offers a fascinating glimpse into the intersection of music and data analytics, highlighting how machine learning can provide insights into cultural trends.

Data Description

The data was exclusively collected from Kaggle and contains a comprehensive list of the most famous songs of 2023 as listed on Spotify. This dataset offers a wealth of features beyond what is typically available in similar datasets. It provides insights into each song’s attributes, popularity, and presence on various music platforms. The dataset includes information such as:rack name, artist(s) name, release date, Spotify playlists and charts, streaming statistics, Apple Music presence, Deezer presence, Shazam charts, and various audio features. The dataset can be downloaded from https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023

Outline

In order to fulfill our goal of achieving a better prediction, we have a detailed plan on how to deal with the data:

1.We will clean our data by removing any of missing observations, as well as remove any predictor variables that are unnecessary for understanding.
2.Based on remaining variables, we will draw some graphs and plots on them to develop a better impression on how they related to a song’s presence on the official Spotify charts.
3. We will then perform a training/test split on our data, make a recipe, and set folds for the 10-fold cross validation we will implement. Logistic Regression, Random Forest,Gradient-Boosted-Trees, LDA, QDA and KNN models will be all used to deal withthe the training data when we finish the setup.
4. After we have results, we will select the best one based on their roc_auc score, and then fit it to our test data to make a prediction on “Yes” or “No”.

Exploring Our Data

we will load our data first and use clean_names() to make every variables standardized.

Loading our dataset

## loading the data
spotify_2023_raw<-read_csv("/Users/zhaolei/Desktop/131 final project/spotify-2023.csv")

## cleaning predictor names
spotify_2023_cleaned <- clean_names(spotify_2023_raw)

Describing the data

to better visualize the dataset, I create a table listed below and I also provide the codebook that enable readers to gain a better understanding of it.

# Using DT to view our data in table
datatable(
  head(spotify_2023_cleaned,50),
  options = list(
    pageLength = 10,      
    autoWidth = TRUE,     
    scrollX = TRUE        
  ),
  class = 'display compact' 
)

track_name: Name of the song
artist_name: Name of the artist(s) of the song
artist_count: Number of artists contributing to the song
released_year: Year when the song was released
released_month: Month when the song was released
released_day: Day of the month when the song was released
in_spotify_playlists: Number of Spotify playlists the song is included in
in_spotify_charts: Presence and rank of the song on Spotify charts
streams: Total number of streams on Spotify
in_apple_playlists: Number of Apple Music playlists the song is included in
in_apple_charts: Presence and rank of the song on Apple Music charts
in_deezer_playlists: Number of Deezer playlists the song is included in
in_deezer_charts: Presence and rank of the song on Deezer charts
in_shazam_charts: Presence and rank of the song on Shazam charts
bpm: Beats per minute, a measure of song tempo
key: Key of the song
mode: Mode of the song (major or minor)
danceability_%: Percentage indicating how suitable the song is for dancing
valence_%: Positivity of the song’s musical content
energy_%: Perceived energy level of the song
acousticness_%: Amount of acoustic sound in the song
instrumentalness_%: Amount of instrumental content in the song
liveness_%: Presence of live performance elements
speechiness_%: Amount of spoken words in the song

Visualizing the data

### Visually Checking Missing Data
vis_miss(spotify_2023_cleaned)

spotify_2023_updated<-spotify_2023_cleaned%>%
  select(-in_shazam_charts)

With 5 percent of missing data in variable in_shazam_chart, we do have to remove it. Also, since it’s not a major platform, so we don’t have to worry too much about its influence on our result. Also, since we do not concern abbout value of Key, so we will remove it later when we do variable selection.

Tidying the data and converting factor

Data Transformation:
Converts in_spotify_charts to a factor with levels “No” and “Yes”.
Converts in_apple_charts and in_deezer_charts to binary factors (0 and 1).
Converts mode to a binary factor where “Major” is 0 and any other value is 1.
Scales down streams by dividing by 1,000,000.

Data Cleaning:
Removes rows with any missing values.

Variable Selection:
Selects a specific set of predictor variables for further analysis and creates a new data frame spotify_2023_updated.

## creating dummy variable for predictor and response
spotify_2023_cleaned <- spotify_2023_cleaned %>%
  mutate(
    in_spotify_charts = as.factor(ifelse(in_spotify_charts == 0, "No", "Yes")),
    in_apple_charts = as.factor(ifelse(in_apple_charts == 0, 0, 1)),
    in_deezer_charts = as.factor(ifelse(in_deezer_charts == 0, 0, 1)),
    mode = as.factor(ifelse(mode == "Major", 0, 1)),
    streams = as.numeric(streams) / 1000000 # we make this smaller
  ) %>%
  filter(complete.cases(.))



## Variable Selection
predictor_of_spotifychart<-c("artist_count","in_spotify_charts","in_spotify_playlists","streams","in_apple_playlists","in_apple_charts","in_deezer_playlists","in_deezer_charts","bpm","mode","danceability_percent","valence_percent","energy_percent","liveness_percent")

spotify_2023_updated <- spotify_2023_cleaned %>%
  select(any_of(predictor_of_spotifychart))

View the final data in table

# Using DT to view our data in table
datatable(
  head(spotify_2023_updated,50),
  options = list(
    pageLength = 10,      
    autoWidth = TRUE,     
    scrollX = TRUE        
  ),
  class = 'display compact' 
)

Visual EDA

To gain a clearer understanding of the distribution of our response variable and predictors, we will:

1.Generate a plot to visualize the response variable.

2.Create a correlation matrix to identify potential correlations among the predictor variables.

3.Produce visualization plots to explore the impact of selected predictor variables on the response variable.

Distriution of most streamed song in spotify charts

Before we delve deeper into building our models, it’s important to note that songs appearing on the Spotify charts are significantly outnumbered by those that do not. This imbalance is expected, since most streaamed songs have very high likelihhod to get into the charts.

spotify_2023_updated%>%
  ggplot(aes(x=in_spotify_charts))+
  geom_bar()+
  labs(x="whether in spotify charts", y="number of songs")

As we can see, “Yes” part is slightly more than “No”.Therefore, our assumption to the spotify music is reasonable.

Variable Correlation Plot

In this part, we’ll make a correlation matrix and then make a heat map of the correlation of these predictors.

# making a correlation matrix and heat map of the predictors
spotify_2023_updated %>%
  select(where(is.numeric)) %>%
  cor() %>%
  corrplot(type = 'full', diag = F, 
           method = 'number')

In this heatmap, it’s not surprising that the upper left corner has the most ambiguous correlation values since most of them are intuitively related to each other. Additionally, the correlations in the lower right corner are also quite significant. However, I did not expect that the percentage correlations between streams and the variables below it would not be very significant, especially since those variables are crucial elements for making the music more popular.

Visual bpm

The dataset shows that bpm values between 90-110 and 120-140 are the most frequent. This indicates that many songs, regardless of their popularity, tend to fall within these tempo ranges.Both charting and non-charting songs are heavily represented in these bpm ranges. This implies that while these tempos are popular, they are not the sole determining factor for a song’s success on the Spotify charts.The prevalence of these bpm ranges among both charting and non-charting songs suggests that these tempos might be characteristic of mainstream music genres. These genres could include pop, dance, and other popular music styles that often feature bpm within these ranges.

ggplot(spotify_2023_updated, aes(x = bpm, fill = in_spotify_charts)) +
  geom_bar() +
  labs(x = "bpm", y = "Count of Songs", title = "Distribution of Songs by bpm") +
  scale_fill_manual(values = c("#0072B2", "#D55E00")) +  # Custom colors for "yes" and "no"
  theme_minimal()

Visual danceability

The most frequently occurring danceability percentage in the dataset is between 70% and 80%. This range contains the highest count of songs, indicating a preference for this level of danceability in general music production.This characteristic is crucial for songs to appeal to a broader audience and perform well on music charts

ggplot(spotify_2023_updated, aes(x = danceability_percent, fill = in_spotify_charts)) +
  geom_bar() +
  labs(x = "danceability", y = "Count of Songs", title = "Distribution of Songs by danceability") +
  scale_fill_manual(values = c("#0072B2", "#D55E00")) +  # Custom colors for "yes" and "no"
  theme_minimal()

Visual streams

The box plot for songs in the Spotify charts (“yes”) is larger than for songs not in the charts (“no”),indicating a higher variability in the number of streams for charting songs. The median (represented by the line inside the box) for the “yes” category is higher compared to the “no” category, showing that charting songs generally have a higher number of streams. There are more extreme values (outliers) for the “yes” category. These are songs with exceptionally high numbers of streams, which significantly exceed the typical range

ggplot(spotify_2023_updated, aes(x = in_spotify_charts, y = as.numeric(streams), fill = in_spotify_charts)) +
  geom_boxplot() +
  labs(x = "In Spotify Charts", y = "Streams", title = "Distribution of Streams by Spotify Charts") +
  scale_fill_manual(values = c("#0072B2", "#D55E00"))+
  theme_minimal()

Setting Up Model

We can now proceed to building our models. First, we will randomly split our data into training and testing sets. Next, we will set up and create our recipe, and finally, we will establish cross-validation within our models.

Splitting Data into Train and Test

Our first step is to split the data into separate datasets. One dataset will be used for training our models, while the other will be reserved as the testing set, which will only be used once when we actually test our models.

First, we need to set a seed to ensure that our random split can be reproduced every time we train our models.

Next, we perform a training/testing split on our data and stratify based on our response variable, in_spotify_charts, to ensure that the distribution of this variable is maintained in both the training and testing sets.

set.seed(0926)
spotify_split<-initial_split(spotify_2023_updated,prop=0.75,strata = in_spotify_charts)
spotify_train<-training(spotify_split)
spotify_test<-testing(spotify_split)

set.seed(0926)
train_dim<-dim(spotify_train)
train_dim

## [1] 611  14

For Data splitting, I chose a proportion of 0.75. From the output of dimension we know that training data is 611 which is pretty large for our project.

Building Our Recipe

We will now bring together our predictors and our response variable to build our recipe, which we will use for all the models. This recipe will ensure that our data is consistently preprocessed for model training and evaluation.

spotify_recipe <- recipe(in_spotify_charts ~ ., data = spotify_train) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_scale(all_numeric_predictors()) %>% 
  step_center(all_numeric_predictors())

K-Fold Cross Validation

We will stratify our cross-validation on our response variable, in_spotify_charts, and use 10 folds to perform stratified cross-validation. Also, we will save the results to an RDA file. This allows us to load the models later without needing to rebuild them, ensuring efficient use of our time and resources.

set.seed(0926)
spotify_folds<-vfold_cv(spotify_2023_updated,v=10,strata=in_spotify_charts)
save(spotify_recipe,spotify_folds, spotify_recipe, spotify_train, spotify_test, file = "/Users/zhaolei/Desktop/131 final project/spotify.rda")

Predition Model

In the last part we will build our prediction models. However, those models take quite long time to compute, so we will load those models from separate R file rather than run them in this R markdown file in order to save more time. In those R file, we have already loaded the data so we don’t have to run it again in the Rmarkdown file.

We use 6 models in total, including random forest, gradient-boosted trees,LDA,QDA,KNN, and logistic regression. the last four are quite simple but the first two are relatively hard, and we are more interested in the last two since they are normally used to deal with classification problem.

Performance Metric

The two most useful tools to evaluate performance are accuracy and ROC AUC, and we will focus more on ROC AUC in the later part. The reason for emphasizing ROC AUC is due to the unbalanced nature of our dataset. In our dataset, the number of songs that are on the Spotify charts (Yes) is likely much more than the number of songs that are not (No). This imbalance can make accuracy a misleading metric because a model could achieve high accuracy by simply predicting the majority class for all instances

Model result

Here we load from other R files so we can save more time when knitting html file.

load("/Users/zhaolei/Desktop/131 final project/logistic_regression.rda")
load("/Users/zhaolei/Desktop/131 final project/random_forest_spotify.rda")
load("/Users/zhaolei/Desktop/131 final project/Gradient_boosted_trees.rda")
load("/Users/zhaolei/Desktop/131 final project/knn_model.rda")
load("/Users/zhaolei/Desktop/131 final project/LDA.rda")
load("/Users/zhaolei/Desktop/131 final project/QDA.rda")

Model Setup Process

The model-building process for most machine learning models tends to follow a similar structure. However, models like Logistic Regression, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA) are simpler and quicker to train, and thus, have a slightly different process. Here is the general workflow for building our models:

Model Specification:

We start by specifying the type of model we want to build, setting the engine, and defining the mode. For our project, the mode is always set to ‘classification’ because our goal is to classify whether a song will be on the Spotify charts or not.

Workflow Setup:

We create a workflow, add the specified model, and incorporate our established Spotify recipe into this workflow. The recipe includes data preprocessing steps such as dummy variable creation and scaling.

Simpler Models:

For simpler models like Logistic Regression, LDA, and QDA, we skip the steps involving hyperparameter tuning (steps 4-6) since these models typically do not require extensive hyperparameter tuning.

Tuning Grid Setup:

For more complex models, we set up a tuning grid with the parameters that we want to tune. We define the range and levels for each parameter to explore different combinations.

Model Tuning:

We tune the model using the specified hyperparameters. This involves training multiple versions of the model with different parameter settings to find the best combination.

Model Selection:

After tuning, we select the model that performed the best based on our chosen evaluation metric (ROC AUC). This step ensures that we are using the most effective model configuration.

Finalizing the Workflow:

We finalize the workflow with the best tuning parameters identified in the previous step. This step ensures that the final model configuration is optimized for performance.

Model Fitting:

We fit the finalized model to our training dataset using the workflow. This step trains the model on the entire training dataset with the optimal parameters.

Model Setup

Here is the coding for our six models, but we don’t run them when we knit the file only show how we set them up, so we state eval=F when we set up the code chunk to save the knitting time.

## Logistic Regression
log_reg_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

log_reg_wf <- workflow() %>%
  add_model(log_reg_spec) %>%
  add_recipe(spotify_recipe)

log_reg_fit <- fit_resamples(log_reg_wf, resamples = spotify_folds, 
                             control = control_resamples(save_pred = TRUE))

## Random Forest
rf_model <- rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

rf_wkflow <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(spotify_recipe)

rf_grid <- grid_regular(mtry(range = c(1, 5)),
                        trees(range = c(100, 500)),
                        min_n(range = c(2, 5)),
                        levels = 5)
rf_res <- tune_grid(
  rf_wkflow,
  resamples = spotify_folds,
  grid = rf_grid,
  control = control_grid(save_pred = TRUE)
)

## KNN
knn_model <- nearest_neighbor(neighbors = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

knn_wflow <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(spotify_recipe)

knn_grid <- grid_regular(neighbors(range = c(1, 20)), levels = 10)

knn_res <- tune_grid(
  knn_wflow,
  resamples = spotify_folds,
  grid = knn_grid,
  control = control_grid(save_pred = TRUE)
)

## Gradient Boosted Trees
bt_class_spec <- boost_tree(mtry = tune(), 
                            trees = tune(), 
                            learn_rate = tune()) %>%
  set_engine("xgboost") %>% 
  set_mode("classification")

bt_class_wf <- workflow() %>% 
  add_model(bt_class_spec) %>% 
  add_recipe(spotify_recipe)

bt_grid <- grid_regular(mtry(range = c(1, 6)), 
                        trees(range = c(200, 600)),
                        learn_rate(range = c(-10, -1)),
                        levels = 5)
tune_bt_class <- tune_grid(
  bt_class_wf,
  resamples = spotify_folds,
  grid = bt_grid
)

### LDA
lda_mod <- discrim_linear() %>% 
  set_mode("classification") %>% 
  set_engine("MASS")

LDA_wkflow <- workflow() %>% 
  add_model(lda_mod) %>% 
  add_recipe(spotify_recipe)

LDA_fit <- fit_resamples(LDA_wkflow, spotify_folds)

## QDA
qda_mod <- discrim_quad() %>% 
  set_mode("classification") %>% 
  set_engine("MASS")

QDA_wkflow <- workflow() %>% 
  add_model(qda_mod) %>% 
  add_recipe(spotify_recipe)

QDA_fit <- fit_resamples(QDA_wkflow, spotify_folds)

Model ROC AUC Values

To summarize the best ROC AUC values from our seven models, we will create a tibble to display the estimated final ROC AUC value for each fitted model.

## logistic regression
log_roc_auc_score <- log_reg_fit%>%collect_metrics()%>%
  slice(3)

## Random Forest
rf_metrics <- collect_metrics(rf_res)
roc_auc_scores_1 <- rf_metrics %>%filter(.metric == "roc_auc") %>% 
  arrange(desc(mean)) %>%
  select(mean)

rf_roc_auc_score<-roc_auc_scores_1$mean[1]


## Gradient Boosted Trees
bt_metrics <- collect_metrics(tune_bt_class)
roc_auc_scores_2 <- bt_metrics %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean)) %>%
  select(mean)

bt_roc_auc_score<-roc_auc_scores_2$mean[1]

## KNN
KNN_metrics <- collect_metrics(knn_res)
roc_auc_scores_3 <- KNN_metrics %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean)) %>%
  select(mean)

knn_roc_auc_score<-roc_auc_scores_3$mean[1]

## LDA
lda_roc_auc_score <- LDA_fit %>%collect_metrics()%>%
  slice(3)

## QDA
qda_roc_auc_score <- QDA_fit %>%collect_metrics()%>%
  slice(3)

## visualize the values
spotify_roc_aucs <- c(log_roc_auc_score$mean,
                           rf_roc_auc_score,
                           bt_roc_auc_score,
                           knn_roc_auc_score,
                           lda_roc_auc_score$mean,
                           qda_roc_auc_score$mean)

spotify_mod_names <- c("Logistic Regression",
                      "Random Forest",
                      "Boosted Trees",
                      "KNN", 
                      "LDA",
                      "QDA"
                      )


spotify_tibble<-tibble("Model"=spotify_mod_names,
                       "Values"=spotify_roc_aucs)%>%
  dplyr::arrange(-spotify_roc_aucs )%>%print()

## # A tibble: 6 × 2
##   Model               Values
##   <chr>                <dbl>
## 1 Random Forest        0.823
## 2 Boosted Trees        0.815
## 3 Logistic Regression  0.808
## 4 LDA                  0.807
## 5 QDA                  0.777
## 6 KNN                  0.764

As we can see in our tibble, the Random Forest model performed the best overall with a ROC AUC score of 0.823. Gradient boosted trees has the second higheat value 0.8146. Since these values only fit to the training data, so we have to perform them to our testing data. in the later part, we will use Random Forest model to predict the testing data.

Visualizing the best models

Random Forest

autoplot(rf_res,metric = "roc_auc")

For the random forest, we tuned the the minimal node size, the number of randomly selected predictors, and the number of trees. From the ouput we can see that the optimal node size was 1, present in the middle plot, with 400 trees, and 4 randomly selected predictors.

Results from best model

From our ourput table we can see that Random Forest #062 has the best performance from all random forest model.

show_best(rf_res, metric = "roc_auc") %>%
  select(-.estimator, .config) %>%
  slice(1)

## # A tibble: 1 × 8
##    mtry trees min_n .metric  mean     n std_err .config               
##   <int> <int> <int> <chr>   <dbl> <int>   <dbl> <chr>                 
## 1     2   300     4 roc_auc 0.823    10  0.0108 Preprocessor1_Model062

ROC Plots

Here we plot the ROC score, The closer the curve is to the top left corner, the better the model’s AUC score. Although our ROC curve does not perfectly reach the top left corner, it trends in that direction, indicating good model performance. This confirms the AUC score we calculated earlier and shows that our model performs well.

best_rf_class <- select_best(rf_res, metric = "roc_auc")
final_rf_wkflow <- finalize_workflow(rf_wkflow, best_rf_class)
final_rf_fit <- fit(final_rf_wkflow, data = spotify_train)
rf_predict_augmented <- augment(final_rf_fit, new_data = spotify_test, type = 'prob')
rf_roc_predict_auc_score <- rf_predict_augmented %>%
  roc_auc(truth = in_spotify_charts, .pred_No) %>%
  select(.estimate)
print(rf_roc_predict_auc_score)

## # A tibble: 1 × 1
##   .estimate
##       <dbl>
## 1     0.822

augment(final_rf_fit, new_data = spotify_test, type = 'prob')%>%
  roc_curve(in_spotify_charts, .pred_No) %>%
  autoplot()

Testing The Model

Now we will test how useful our random forest model is in predicting whether the song is in spotify charts or not.

Testing “Yes”

In this part, We create a new dataset with specific values for these predictors to simulate a song with high potential to be on the Spotify charts. Next, we select the best parameters for our Random Forest model based on previous tuning results. We finalize our workflow with these parameters and fit the model using our training data. Then, we use the fitted model to predict whether our simulated song would appear on the Spotify charts.

predictor_of_Yes <-predictor_of_spotifychart<-c("artist_count","in_spotify_playlists","streams","in_apple_playlists","in_apple_charts","in_deezer_playlists","in_deezer_charts","bpm","mode","danceability_percent","valence_percent","energy_percent","liveness_percent")

values_of_Yes <- c(2,2651,304.118600,21,1,32,1,94,0,89,61,66,36)

predicting_Yes <- as_tibble(as.data.frame(matrix(values_of_Yes, nrow = 1)))
names(predicting_Yes) <- predictor_of_Yes
predicting_Yes <- predicting_Yes %>%
  mutate(
    in_apple_charts = factor(in_apple_charts, levels = c(0, 1)),
    in_deezer_charts = factor(in_deezer_charts, levels = c(0, 1)),
    mode = factor(mode, levels = c(0, 1))
  )

best_rf_class <- tibble(
  mtry = 2,
  trees = 300,
  min_n = 4
)

final_rf_wkflow <- finalize_workflow(rf_wkflow, best_rf_class)

final_rf_fit <- fit(final_rf_wkflow, data = spotify_train)
predict(final_rf_fit, new_data = predicting_Yes, type = "class")

## # A tibble: 1 × 1
##   .pred_class
##   <fct>      
## 1 Yes

Predicting No

Similarly, we create another tibble (predicting_No) with values that suggest a lower likelihood of charting. This dataset undergoes the same formatting process as the first one.

predictor_of_No <-predictor_of_spotifychart<-c("artist_count","in_spotify_playlists","streams","in_apple_playlists","in_apple_charts","in_deezer_playlists","in_deezer_charts","bpm","mode","danceability_percent","valence_percent","energy_percent","liveness_percent")
values_of_No <- c(2,4260,1065.580332,113,1,259,0,120,1,65,80,86,19)

predicting_No <- as_tibble(as.data.frame(matrix(values_of_No, nrow = 1)))
names(predicting_No) <- predictor_of_No
predicting_No <- predicting_No %>%
  mutate(
    in_apple_charts = factor(in_apple_charts, levels = c(0, 1)),
    in_deezer_charts = factor(in_deezer_charts, levels = c(0, 1)),
    mode = factor(mode, levels = c(0, 1))
  )

predict(final_rf_fit, new_data = predicting_No, type = "class")

## # A tibble: 1 × 1
##   .pred_class
##   <fct>      
## 1 No

As we can see, the model correctly predicted whether the song would appear on the Spotify charts. This is very encouraging, as it shows that our model is effective. This result reassure us practical utility of our model.

Extension Final Accuracies

augment(final_rf_fit, new_data = spotify_test, type = 'prob') %>%
  accuracy(in_spotify_charts, .pred_class) %>%
  select(.estimate)

## # A tibble: 1 × 1
##   .estimate
##       <dbl>
## 1     0.732

Our random forest model was able to predict beautiful sunsets in our testing data with about 73.1% accuracy.

Conclusion

The project successfully developed predictive models that can identify songs likely to appear on the Spotify charts based on 2023 data. This predictive capability provides valuable insights into factors contributing to song popularity on streaming platforms. Stakeholders in the music industry can leverage these models to optimize promotional strategies, playlist placements, and artist collaborations to maximize song visibility and streaming performance. Future iterations of this project could explore more advanced modeling techniques, incorporate real-time data for ongoing predictions, and expand the scope to include more diverse datasets and platforms beyond American platform.

To further improve the model’s accuracy, incorporating additional data sources such as social media trends and listener demographics could be highly beneficial. Social media trends, including the number of mentions, shares, and likes a song receives across platforms like Twitter, Instagram, and TikTok, can provide real-time indicators of a track’s popularity and potential for virality. Listener demographics, including age, gender, location, and listening habits, can offer deeper insights into the target audience and how different segments engage with music. Additionally, implementing more advanced techniques like deep learning and natural language processing (NLP) for analyzing song lyrics and sentiments could reveal patterns in lyrical content, emotional tone, and thematic elements that resonate with listeners. These methods can help in identifying the types of lyrics that drive engagement and streaming counts. Furthermore, exploring the impact of marketing strategies and playlist placements, such as the effect of being featured on popular playlists or receiving high-profile endorsements, could offer a more comprehensive understanding of what drives a song’s success on streaming platforms. This holistic approach can provide a richer, multi-dimensional view of the factors contributing to a track’s popularity, enabling more accurate predictions and actionable insights for artists and industry stakeholders .

Citation

source of dataset:https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023