What makes a song a ’Hit? Generic answers from public tend to be, “The song must get into the right hand!”.

It was rightfully so, and still is. However, in the latter days of this past decade, transformation of music industry has opened the pandora box of music industry. The chances of your music being heard has jumped out, thanks to online music streaming platform. No matter whether you are an unsigned artist, those streaming platform has democratised the music-making process in the 2010s, making a potential worldwide audience available to anyone with a cheap recording deviceIn the more official way, they will serve you with their aggregators to get your music up.

The next question is, after your music is up, what can boost its chances of being listened widely? Algorithm.

Yes, streaming platform incorporate mathematics and build algorithm to curate playlist based on audio features generated. But if we trace back on any music-making process, any music artist can use this opportunity to cook up something that fit within the market. From the question "What’s shaped the market’ to ‘which elements should we fuse our music in?’.

It is rarely only one element that causes a song to rise above the competition. Why not use all the tools at your command, including Machine Learning?

In this project, we will see what has been the characteristic of a Hit song using Machine Learning Classification algorithm and build a model which can help predict hit or flop potential of a song, if one exists in the 2010s.

But first, as a music enthusiast myself, I have a preliminary hypothesis, which is whatever on the top charts have been so homogeneous and formulaic. They share the same characteristics : Energetic, loud, and anything that trigger dancing crowd.

1 Data Prepocessing

# Import data
spotify <- read.csv("datasets_496640_1108669_dataset-of-10s.csv")
spotify

# check missing value
anyNA(spotify)

#> [1] FALSE

# inspect data
glimpse(spotify)

#> Rows: 6,398
#> Columns: 19
#> $ track            <fct> Wild Things, Surfboard, Love Someone, Music To My ...
#> $ artist           <fct> Alessia Cara, Esquivel!, Lukas Graham, Keys N Krat...
#> $ uri              <fct> spotify:track:2ZyuwVvV6Z3XJaXIFbspeE, spotify:trac...
#> $ danceability     <dbl> 0.741, 0.447, 0.550, 0.502, 0.807, 0.482, 0.533, 0...
#> $ energy           <dbl> 0.6260, 0.2470, 0.4150, 0.6480, 0.8870, 0.8730, 0....
#> $ key              <int> 1, 5, 9, 0, 1, 0, 0, 2, 7, 8, 1, 2, 5, 0, 1, 2, 8,...
#> $ loudness         <dbl> -4.826, -14.661, -6.557, -5.698, -3.892, -3.145, -...
#> $ mode             <int> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,...
#> $ speechiness      <dbl> 0.0886, 0.0346, 0.0520, 0.0527, 0.2750, 0.0853, 0....
#> $ acousticness     <dbl> 0.020000, 0.871000, 0.161000, 0.005130, 0.003810, ...
#> $ instrumentalness <dbl> 0.00e+00, 8.14e-01, 0.00e+00, 0.00e+00, 0.00e+00, ...
#> $ liveness         <dbl> 0.0828, 0.0946, 0.1080, 0.2040, 0.3910, 0.4090, 0....
#> $ valence          <dbl> 0.7060, 0.2500, 0.2740, 0.2910, 0.7800, 0.7370, 0....
#> $ tempo            <dbl> 108.029, 155.489, 172.065, 91.837, 160.517, 165.08...
#> $ duration_ms      <int> 188493, 176880, 205463, 193043, 144244, 214320, 26...
#> $ time_signature   <int> 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
#> $ chorus_hit       <dbl> 41.18681, 33.18083, 44.89147, 29.52521, 24.99199, ...
#> $ sections         <int> 10, 9, 9, 7, 8, 12, 14, 10, 11, 9, 10, 13, 12, 8, ...
#> $ target           <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,...

Variables explanation :
* Energy : Depicts the intensity combusted from a track. The higher the value, the more energetic the track. Normally, tracks with loud noises and fast tempo, such as metal, have high energy. Meanwhile, on the opposite spectrum, a Bach prelude scores low on the scale.
* Danceability : Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity
* Loudness (stated in decibles (dB)) :
* Liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. * Valence : Describes the musical positiveness conveyed by a track. High score means happy, cheerful, euphoric etc. On the other hand, a track with low score means sad, depressed, etc. * Acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic * Speechiness : Detects the presence of spoken words in a track. The range goes from 0.0 (non-speech-like, total music) to 1.0 (exclusively speech-like). Normally, value range attributed to a song is below 0.33, unless it is rap music which may reach 0.66. * Duration_ms: The duration of the track in milliseconds. * Time_signature: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). * Chorus_hit: This the the author’s best estimate of when the chorus would start for the track. Its the timestamp of the start of the third section of the track. This feature was extracted from the data received by the API call for Audio Analysis of that particular track. * Sections: The number of sections the particular track has. This feature was extracted from the data received by the API call for Audio Analysis of that particular track. * Target : The target variable for the track. It can be either ‘0’ or ‘1’. ‘1’ implies that this song has featured in the weekly list (Issued by Billboards) of Hot-100 tracks in that decade at least once and is therefore a ‘hit’. ‘0’ Implies that the track is a ‘flop’. The author’s condition of a track being ‘flop’ is as follows: - The track must not appear in the ‘hit’ list of that decade. - The track’s artist must not appear in the ‘hit’ list of that decade. - The track must belong to a genre that could be considered non-mainstream and / or avant-garde. - The track’s genre must not have a song in the ‘hit’ list. - The track must have ‘US’ as one of its markets.

spotify_clean <- spotify[!duplicated(spotify[, c('track','artist')]), ] %>%
  mutate(target = as.factor(target)) %>% 
  select(-uri) %>% 
  select(18,3:17) %>% 
  mutate(target = recode(target, "1" = "Hit",
         "0" = "Flop"))
spotify_clean

Proportion of each target is balanced enough to proceed with next step.

table(spotify_clean$target)

#> 
#> Flop  Hit 
#> 3059 3199

2 Data Wrangling and Manipulation

Let’s check what variables are going to be useful based on statistic and business perspective.

# data cleaning and manipulation
spotify_clean %>% 
  select(target, names(.)[c(2:16)]) %>% 
  pivot_longer(2:16) %>% 
  ggplot(aes(x = value)) + 
  geom_density(aes(color = target)) +
  facet_wrap(~name, ncol = 3, scales="free") +
  labs(title = "Audio Features - Hit or Flop", x = "", y = "density")+      theme_minimal() + theme(axis.text.y = element_blank())

In 2010s, we can say that regardless of songs made it or not have similaralities. The trend was more about intense energy and tempo, high danceability, loud, happy, just anything that makes the room more lively.

In this case, we may also noted that instrumentalness may not help much in our project. That makes sense since our data is song which definitely have instrumentalness anyway.

ggcorr(NULL, cor_matrix = cor(spotify_clean[,c(2:16)]),
       label = T, geom = "blank", hjust = 0.90) +
  geom_point(size = 10, aes(color = coefficient > 0, 
                            alpha = abs(coefficient) > 0.5)) +
  scale_alpha_manual(values = c("TRUE" = 0.25, "FALSE" = 0)) +
  guides(color = FALSE, alpha = FALSE)

Our correlation matrix above also shows that most of the audio features are independent except for several highly positive correlation between energy and loudness (which we have to drop one of them), as well as negative accousticness, speechiness and loudness which are both understandable.

# dropping unncessary variables
spotify_clean_model <- spotify_clean %>% 
  select(-loudness, -sections, -instrumentalness)

3 Classification Model

Now, it is time to find the best classification model to know whether audio features can help identifying a success of a song and finding the relevancy of each audio features in the process.

Classification algorithms that may help us achieve our goal with a great interpretation is decision tree and Random Forest.

3.1 Cross Validation

Before we build our model, we should split the dataset into training and test data. In this step, we will split the data into 80% training and 20% test proportion with set.seed(100).

RNGkind(sample.kind = "Rounding")
set.seed(100)
# your code here
idx <- sample(nrow(spotify_clean_model), nrow(spotify_clean_model)*0.8)
spotify_train<- spotify_clean_model[idx,]
spotify_test <- spotify_clean_model[-idx,]

spotify_recipe <- recipe(target~., spotify_train) %>% 
  step_scale(all_numeric()) %>%
  prep()

spotify_train <- juice(spotify_recipe)
spotify_test <- bake(spotify_recipe, spotify_test)

prop.table(table(spotify_train$target))

#> 
#>      Flop       Hit 
#> 0.4882141 0.5117859

prop.table(table(spotify_test$target))

#> 
#>      Flop       Hit 
#> 0.4912141 0.5087859

3.2 Decision Tree

# model building
spotify_dt <- rpart(target ~ ., spotify_train)
fancyRpartPlot(spotify_dt, sub = NULL)

From the plot we notice that the most important predictor of a song success in the past decade is danceability, followed by duration, accousticness, and energy.

# model fitting
pred_dt <- predict(spotify_dt, newdata = spotify_test, type = "class")
prob_dt <- predict(spotify_dt, newdata = spotify_test, type = "prob")

# result
spotify_dt_table <- select(spotify_test, target) %>%
  bind_cols(target_pred = pred_dt) %>% 
  bind_cols(target_eprob = round(prob_dt[,1],4)) %>% 
  bind_cols(target_pprob = round(prob_dt[,2],4))

# performance evaluation - confusion matrix
spotify_dt_table %>% 
  conf_mat(target, target_pred) %>% 
  autoplot(type = "heatmap")

spotify_dt_table %>%
  summarise(
    accuracy = accuracy_vec(target, target_pred),
    sensitivity = sens_vec(target, target_pred),
    specificity = spec_vec(target, target_pred),
    precision = precision_vec(target, target_pred))

# ROC
spotify_dt_roc <- data.frame(prediction=round(prob_dt[,1],4),
                    trueclass=as.numeric(spotify_dt_table$target=="Flop"))
spotify_dt_roc

spotify_dt_roc <- ROCR::prediction(spotify_dt_roc$prediction,
                                   spotify_dt_roc$trueclass) 
auc_ROCR_dt <- performance(spotify_dt_roc, measure = "auc")
auc_ROCR_dt <- auc_ROCR_dt@y.values[[1]]
final_dt <- spotify_dt_table %>%
  summarise(accuracy = accuracy_vec(target, target_pred),
    sensitivity = sens_vec(target, target_pred),
    specificity = spec_vec(target, target_pred),
    precision = precision_vec(target, target_pred)) %>% 
  cbind(AUC = auc_ROCR_dt)

plot(performance(spotify_dt_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

Unfortunately trees are known to be less accurate than other classical methods and have an high variance. In order to improve their performance, there are methods like Bagging, Boosting and Random Forests. Since Bagging is a particular case of Random Forests (when m = p) and bagged trees are usually correlated, with this technique we are likely to not get the best tree. Hence, we are going to perform only Random Forests.

Let’s see whether the accuracy is better by using Random Forest.

3.3 Random Forest

# model building
RNGkind(sample.kind = "Rounding")
set.seed(417)
ctrl <- trainControl(method="repeatedcv", number=4, repeats=4) # k-fold cross validation
forest <- train(target ~ ., data=spotify_train, method="rf", trControl = ctrl)
forest

#> Random Forest 
#> 
#> 5006 samples
#>   12 predictor
#>    2 classes: 'Flop', 'Hit' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (4 fold, repeated 4 times) 
#> Summary of sample sizes: 3755, 3754, 3754, 3755, 3754, 3754, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.7860076  0.5702489
#>    7    0.7842107  0.5669088
#>   12    0.7806152  0.5597047
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.

forest1 <- randomForest(target ~ ., ntree = 100, importance = TRUE, data = spotify_train)
forest1

#> 
#> Call:
#>  randomForest(formula = target ~ ., data = spotify_train, ntree = 100,      importance = TRUE) 
#>                Type of random forest: classification
#>                      Number of trees: 100
#> No. of variables tried at each split: 3
#> 
#>         OOB estimate of  error rate: 21.07%
#> Confusion matrix:
#>      Flop  Hit class.error
#> Flop 1762  682   0.2790507
#> Hit   373 2189   0.1455894

From the model summary, we know that the optimum number of variables considered for splitting at each tree node is 2. We can also inspect the importance of each variable that was used in our random forest using varImp().

varImp(forest)

#> rf variable importance
#> 
#>                Overall
#> acousticness   100.000
#> danceability    93.208
#> energy          84.775
#> duration_ms     80.778
#> valence         63.008
#> speechiness     45.247
#> tempo           39.505
#> chorus_hit      37.733
#> liveness        36.055
#> key             19.655
#> time_signature   2.186
#> mode             0.000

forest$finalModel

#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 2
#> 
#>         OOB estimate of  error rate: 21.27%
#> Confusion matrix:
#>      Flop  Hit class.error
#> Flop 1719  725   0.2966448
#> Hit   340 2222   0.1327088

# model fitting
pred_forest <- predict(forest, newdata = spotify_test, type = "raw")
prob_forest <- predict(forest, newdata = spotify_test, type = "prob")

# result
spotify_forest_table <- select(spotify_test, target) %>%
  bind_cols(target_pred = pred_forest) %>% 
  bind_cols(target_eprob = round(prob_forest[,1],4)) %>% 
  bind_cols(target_pprob = round(prob_forest[,2],4))

# performance evaluation - confusion matrix
spotify_forest_table %>% 
  conf_mat(target, target_pred) %>% 
  autoplot(type = "heatmap")

spotify_forest_table %>%
  summarise(
    accuracy = accuracy_vec(target, target_pred),
    sensitivity = sens_vec(target, target_pred),
    specificity = spec_vec(target, target_pred),
    precision = precision_vec(target, target_pred))

# ROC
spotify_forest_roc <- data.frame(prediction=round(prob_forest[,1],4),
                  trueclass=as.numeric(spotify_forest_table$target=="Flop"))
spotify_forest_roc

spotify_forest_roc <- ROCR::prediction(spotify_forest_roc$prediction,
                                       spotify_forest_roc$trueclass) 

# ROC curve
plot(performance(spotify_forest_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

# AUC
auc_ROCR_f <- performance(spotify_forest_roc, measure = "auc")
auc_ROCR_f <- auc_ROCR_f@y.values[[1]]
final_f <- spotify_forest_table %>%
  summarise(
    accuracy = accuracy_vec(target, target_pred),
    sensitivity = sens_vec(target, target_pred),
    specificity = spec_vec(target, target_pred),
    precision = precision_vec(target, target_pred)) %>% 
  cbind(AUC = auc_ROCR_f)

4 Conclusion

rbind("Decision Tree" = final_dt, "Random Forest" = final_f)

Based on the metrics table above, the predictive model built using Random Forest algorithm gave the best result. The model gave highest accuracy 79% while also maintain sensitivity, specificity, and precision above 70%. It also gave the highest AUC at 86%. Therefore the best model of Hit Song Prediction based on audio features is the Random Forest model.

4.1 Bonus

We know that by using Random Forest, our model can classify whether a song is a hit or not by 78% accuracy. However, what does that mean? What actually embodies a hit track?

# usefulness in splitting the data
importance_dt <- data.frame(importance = spotify_dt$variable.importance)
importance_dt$feature <- row.names(importance_dt)

# mean decrease in impurity
importance_rf <- varImp(forest)$importance 
importance_rf$feature <- row.names(importance_rf)
importance_rf <- importance_rf %>% 
  rename_at("Overall", ~"importance")

importance_comb <- importance_dt %>% 
  select(feature, importance) %>% 
  right_join(importance_rf, by = "feature") %>% 
  mutate_at(2, ~replace(., is.na(.), 0)) %>% 
  rename_at("importance.x", ~"decision_tree") %>% 
  rename_at("importance.y", ~"random_forest") %>% 
  mutate_if(is.numeric, scale, center = TRUE) %>%
  pivot_longer(cols = c('decision_tree', 'random_forest')) %>%
  dplyr::rename('model' = 'name') %>%
  ggplot(aes(x = reorder(feature, value, mean, na.rm = TRUE), y = value, color = model)) + 
  geom_point(size = 2) + 
  coord_flip() +
  labs(title = 'Variable Importance by Model',
       subtitle = 'Scaled for comparison',
       y = 'Scaled value', x = '') +
  theme_minimal()

importance_comb

5 Improvement

This project will be further developed by incorporating PCA (Principal Component Analysis) of audio features and trying to predict the chance of hit tracks from previous decades of being one in this time and period.

Classification - Spotify

Yoanna Evelina

8/23/2020