Project Details

Description

The aim of this this project was to predict songs rating using audio features as predictors. The algorithms selected for this purpose will be Linear Regression, Tree Model (default, maximal, and tuned), Bootstrap Aggregation (BAG) Model, and Random Forest Model (tuned and default with ranger and randomForest library). The evaluation metric discussed is RMSE (root mean square error). What makes some songs have higher ratings than the others? The dataset describes most rated songs based on auditory features, such as key and loudness.

Goal

Construct a model using a dataset of songs to predict ratings based on auditory features of the songs included in scoringData.csv.

Metric

Submissions will be evaluated based on RMSE. Lower the RMSE, better the model.

Packages Used

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(gbm)

## Loaded gbm 2.1.8.1

library(knitr)

1. Exploratory Data Analysis

In this phase, I tried to understand and learn about the case and data structure. I divided this process into four phases: read literature to learn about all of the variables and how it potentially impacts rating, understand and deep dive into the data itself, and analyze the data with descriptive statistics.

#Reading the data
songs = read.csv('analysisData.csv')
str(songs)

## 'data.frame':    19485 obs. of  19 variables:
##  $ id              : int  94500 64901 28440 19804 83560 16501 58033 67048 48848 95622 ...
##  $ performer       : chr  "Andy Williams" "Sandy Nelson" "Britney Spears" "Taylor Swift" ...
##  $ song            : chr  "......And Roses And Roses" "...And Then There Were Drums" "...Baby One More Time" "...Ready For It?" ...
##  $ genre           : chr  "['adult standards', 'brill building pop', 'easy listening', 'mellow gold']" "['rock-and-roll', 'space age pop', 'surf music']" "['dance pop', 'pop', 'post-teen pop']" "['pop', 'post-teen pop']" ...
##  $ track_duration  : num  166106 172066 211066 208186 182080 ...
##  $ track_explicit  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ danceability    : num  0.154 0.588 0.759 0.613 0.45 0.57 0.612 0.253 0.575 0.615 ...
##  $ energy          : num  0.185 0.672 0.699 0.764 0.294 0.629 0.542 0.232 0.434 0.497 ...
##  $ key             : int  5 11 0 2 7 9 5 0 5 7 ...
##  $ loudness        : num  -14.06 -17.28 -5.75 -6.51 -12.02 ...
##  $ mode            : int  1 0 0 1 1 0 1 1 1 1 ...
##  $ speechiness     : num  0.0315 0.0361 0.0307 0.136 0.0318 0.0331 0.0264 0.0318 0.0312 0.439 ...
##  $ acousticness    : num  0.911 0.00256 0.202 0.0527 0.832 0.593 0.0781 0.805 0.735 0.016 ...
##  $ instrumentalness: num  2.67e-04 7.45e-01 1.31e-04 0.00 3.53e-05 1.36e-04 0.00 1.80e-04 6.59e-05 0.00 ...
##  $ liveness        : num  0.112 0.145 0.443 0.197 0.108 0.77 0.0763 0.0939 0.105 0.312 ...
##  $ valence         : num  0.15 0.801 0.907 0.417 0.146 0.308 0.433 0.307 0.348 0.769 ...
##  $ tempo           : num  84 122 93 160 141 ...
##  $ time_signature  : int  4 4 4 4 4 4 4 3 4 3 ...
##  $ rating          : int  36 16 70 64 19 34 44 34 47 26 ...

#Data structure and distribution
library(skimr)
skim(songs)

Data summary
Name	songs
Number of rows	19485
Number of columns	19
_______________________
Column type frequency:
character	3
logical	1
numeric	15
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
performer	0	1.00	1	113	6687
song	0	1.00	1	75	16542
genre	108	0.99	2	319	2937

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
track_explicit	0	1	0.12	FAL: 17203, TRU: 2282

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	50208.92	29047.37	3.00	24812.00	50487.00	75544.00	99999.00	▇▇▇▇▇
track_duration	1	220873.01	68749.76	29688.00	175173.00	214733.00	253306.00	3079157.00	▇▁▁▁▁
danceability	1	0.60	0.15	0.00	0.50	0.61	0.71	0.99	▁▂▆▇▂
energy	1	0.62	0.20	0.00	0.48	0.63	0.78	1.00	▁▃▆▇▅
key	1	5.23	3.56	0.00	2.00	5.00	8.00	11.00	▇▃▃▅▆
loudness	1	-8.67	3.61	-28.03	-11.04	-8.21	-5.86	2.29	▁▁▅▇▁
mode	1	0.73	0.44	0.00	0.00	1.00	1.00	1.00	▃▁▁▁▇
speechiness	1	0.07	0.08	0.00	0.03	0.04	0.07	0.92	▇▁▁▁▁
acousticness	1	0.29	0.28	0.00	0.05	0.19	0.51	0.99	▇▃▂▂▁
instrumentalness	1	0.03	0.14	0.00	0.00	0.00	0.00	0.98	▇▁▁▁▁
liveness	1	0.19	0.16	0.01	0.09	0.13	0.25	1.00	▇▂▁▁▁
valence	1	0.60	0.24	0.00	0.41	0.62	0.80	0.99	▂▅▇▇▇
tempo	1	120.24	27.92	0.00	99.08	119.00	136.39	241.01	▁▃▇▂▁
time_signature	1	3.93	0.32	0.00	4.00	4.00	4.00	5.00	▁▁▁▇▁
rating	1	36.69	16.55	0.00	24.00	36.00	50.00	91.00	▃▇▇▃▁

Understand and deep dive into the data itself

Based on the output, the data has 19,485 observations with 19 variables that majorly have right-skewed distribution. I learned that the dataset contains variables referencing information about the songs which include the song details (performer, title, genre, and id) and its audio features. The audio features in this data fall into three big categories, which are:

1. Confidence measures - acousticness, liveness, speechiness, and instrumentalness
1. Perceptual measures - loudness, energy, danceability, and valence
1. Music descriptors - tempo, duration, key, mode, and time signature

Descriptive statistics

library(psych)
describe(songs)

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf

In this phase, I deep dived the characteristics of the data according to its distribution. As the main metric, rating shows an average of 36.7 (range of 0-91). Furthermore, the mean values depicts interesting characteristics about the audio features of the tracks listed on the data set. In terms of danceability, most of the tracks listed are very danceable (m: 0.60, sd: 0.15) and showing high levels of energy (m: 0.62, sd: 0.20). Furthermore, most of the tracks are soft and not loud (m: -8.67, sd: 3.61), less instrumental (m:0.03, sd: 0.14), and were not using live session mode (m:0.19, sd: 0.16). Finally, the majority of the tracks reflects positive tune, such as cheerful and fun, demonstrated by high valence (m: 0.60, sd:0.24).

Nonetheless, the descriptive statistics above lack categorical variable analysis especially for genre variable. Therefore, cleaning the data process, which comes next, becomes very important to gain insights about genre as a predictor with potentially and hypothetically high predictive power.

2. Data Cleaning

In the preliminary data cleaning phase, I divided the process into 4 steps: check possible duplications, convert several variables into factors, find and fill NA, and clean genre column.

#Check Duplication
dim(songs)

## [1] 19485    19

length(unique(songs$id))

## [1] 19485

The output above illustrates that there’s no duplications in the dataset as the number of rows equals the number of unique id. Meaning that each row represents different observations.

After checking the possibility of duplications, I transformed mode, key, and time_siganture variables into factors since those were categorical variables. Transforming those variables would lead to better computation when I construct the model later.

#Transforming Variables
songs <- songs %>% 
  mutate(mode = as.factor(mode),
         key = as.factor(key), 
         time_signature = as.factor(time_signature))

#Check NA
colSums(is.na(songs))

##               id        performer             song            genre 
##                0                0                0              108 
##   track_duration   track_explicit     danceability           energy 
##                0                0                0                0 
##              key         loudness             mode      speechiness 
##                0                0                0                0 
##     acousticness instrumentalness         liveness          valence 
##                0                0                0                0 
##            tempo   time_signature           rating 
##                0                0                0

According to the result, there are no significant missing values except the genre variable which contains 108 missing values out of 19,485 rows. Albeit the proportion of missing data is not too significant (0.55%), I acknowledged that missing data might interrupt some of the machine learning and statistical procedures that would be performed. Therefore, it is crucial to take the action plan to manage conflicts when there’s any missing data in the observations.

The next step is to clean the genre column by separating each genre from individual observation and assessing the predictive power from the most popular genres.

#Cleaning Genre column
songs$genre<-gsub("\\[|\\]","",as.character(songs$genre))
songs$genre<-gsub("'","",as.character(songs$genre))
songs$genre<-gsub(" ","",as.character(songs$genre))
songs %<>% mutate(genre_clean = genre) %>% separate_rows(genre_clean, sep = ",")

Considering the untapped potential of predictive power for genre column, I took conservative decision to keep the NA and filled those NAs with “No Genre” with code below:

#Fill the NA
songs <- songs %>% mutate(pa = 1)
songs$genre_clean[songs$genre_clean == ""] <- "No Genre"
str(songs)

## tibble [95,120 × 21] (S3: tbl_df/tbl/data.frame)
##  $ id              : int [1:95120] 94500 94500 94500 94500 64901 64901 64901 28440 28440 28440 ...
##  $ performer       : chr [1:95120] "Andy Williams" "Andy Williams" "Andy Williams" "Andy Williams" ...
##  $ song            : chr [1:95120] "......And Roses And Roses" "......And Roses And Roses" "......And Roses And Roses" "......And Roses And Roses" ...
##  $ genre           : chr [1:95120] "adultstandards,brillbuildingpop,easylistening,mellowgold" "adultstandards,brillbuildingpop,easylistening,mellowgold" "adultstandards,brillbuildingpop,easylistening,mellowgold" "adultstandards,brillbuildingpop,easylistening,mellowgold" ...
##  $ track_duration  : num [1:95120] 166106 166106 166106 166106 172066 ...
##  $ track_explicit  : logi [1:95120] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ danceability    : num [1:95120] 0.154 0.154 0.154 0.154 0.588 0.588 0.588 0.759 0.759 0.759 ...
##  $ energy          : num [1:95120] 0.185 0.185 0.185 0.185 0.672 0.672 0.672 0.699 0.699 0.699 ...
##  $ key             : Factor w/ 12 levels "0","1","2","3",..: 6 6 6 6 12 12 12 1 1 1 ...
##  $ loudness        : num [1:95120] -14.1 -14.1 -14.1 -14.1 -17.3 ...
##  $ mode            : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 1 1 1 1 ...
##  $ speechiness     : num [1:95120] 0.0315 0.0315 0.0315 0.0315 0.0361 0.0361 0.0361 0.0307 0.0307 0.0307 ...
##  $ acousticness    : num [1:95120] 0.911 0.911 0.911 0.911 0.00256 0.00256 0.00256 0.202 0.202 0.202 ...
##  $ instrumentalness: num [1:95120] 0.000267 0.000267 0.000267 0.000267 0.745 0.745 0.745 0.000131 0.000131 0.000131 ...
##  $ liveness        : num [1:95120] 0.112 0.112 0.112 0.112 0.145 0.145 0.145 0.443 0.443 0.443 ...
##  $ valence         : num [1:95120] 0.15 0.15 0.15 0.15 0.801 0.801 0.801 0.907 0.907 0.907 ...
##  $ tempo           : num [1:95120] 84 84 84 84 122 ...
##  $ time_signature  : Factor w/ 5 levels "0","1","3","4",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ rating          : int [1:95120] 36 36 36 36 16 16 16 70 70 70 ...
##  $ genre_clean     : chr [1:95120] "adultstandards" "brillbuildingpop" "easylistening" "mellowgold" ...
##  $ pa              : num [1:95120] 1 1 1 1 1 1 1 1 1 1 ...

After separating each genre, I performed EDA by assessing the top 20 genres according to its quantity.

#Top 20 Genres
library(dplyr)
top20 <- songs %>% group_by(genre_clean) %>%
  summarize(total = n()) %>%
  arrange(desc(total)) %>%
  head(20)
top20

The Top 20 Genres, according to the results, were mellowgold, softrock, adultstandards, rock, dancepop, pop, brillbuildingpop, soul, motown, folkrock, poprap , poprap, albumrock, rap, classicrock, quietstorm, hiphop, bubblegumpop, country, funk, and rock-and-roll.

The next step was to transform all of the genres into several columns and use those genres as predictors in building the model, with the code below:

#Melt into several columns
songs <- songs %>% pivot_wider(names_from = genre_clean, values_from = pa, values_fill = 0)

Overall, I felt that the data cleaning process was quite straightforward in terms of finding missing values, duplications, and variable transformation. Nonetheless, it’s undeniable that transforming genre variables into columns presents challenges in cleaning the data and processing the data - since the data keeps getting bigger with additional columns. Furthermore, I realized that there was a possibility of rising predictive power if I removed outliers. On the other hand, there’s also the possibility that I might’ve ignored the data that my model was supposed to learn because the extreme values in the outliers were not errors. Therefore, this is a learning opportunity that I need to explore for my future projects.

3. Modeling Technique

In the Analysis and Techniques part, I splitted the process into several phases:

1. Feature selection
1. Data splitting
1. Parameters tuning and model evaluation
1. Ensembling the best performing model

I curated some of the important processes in this particular step and left the remaining of it within the preliminary RMD file to reduce computational complexity and for simplicity purposes.

Feature Selection

I acknowledged that the properties of a good predictors are related to the outcome variable (relevant) and no or little multicollinearity (non-redundant). For that reason, I started this phase by deploying correlation analysis (bivariate filter) and assessed the possible variable of importance across all possible predictors. After knowing those predictors, I compared the results by doing feature selection with Subset Selections and Shrinkage.

#Correlation Analysis
library(corrplot)

## corrplot 0.92 loaded

songs_corr <- songs %>% select(-c(id, performer, genre, song)) %>%
                              mutate(track_explicit = case_when(track_explicit == "FALSE" ~ 0,
                                                                track_explicit == "TRUE" ~1 )) %>%
                              mutate(key = as.numeric(key), mode = as.numeric(mode), time_signature = as.numeric(time_signature), rating = as.numeric(rating))

O = cor(songs_corr)
corrplot(O, method = 'color', order = 'alphabet')

From the results above, the correlation will not show the right visualization. Therefore, I tried to eliminate some of the variables and chose only top 10 genres.

#Correlation Analysis: Top 10 Genres
songs_clean <- songs %>% mutate(track_explicit = case_when(track_explicit == "FALSE" ~ 0,
                                            track_explicit == "TRUE" ~1 )) %>%

                          select(rating, track_duration , track_explicit , danceability , energy , key , loudness , mode , speechiness , acousticness , instrumentalness, liveness, valence , tempo,  time_signature, mellowgold , softrock , adultstandards , rock , dancepop , pop , brillbuildingpop , soul , motown , folkrock) %>% mutate(key = as.numeric(key), mode = as.numeric(mode), time_signature = as.numeric(time_signature), rating = as.numeric(rating))

M = cor(songs_clean)
corrplot(M, method = 'color', order = 'alphabet')

From the analysis, the predictors with the strongest correlation to rating were: pop, dancepop, track_duration, and track_explicity. The next step was to validate the results through the Subset Selection method.

#Feature engineering (Forward)
start_mod = lm(rating~1,data=songs)
empty_mod = lm(rating~1,data=songs)
full_mod = lm(rating~track_duration + track_explicit + danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness+ liveness+ valence + tempo+  time_signature+ mellowgold + softrock + adultstandards + rock + dancepop + pop + brillbuildingpop + soul + motown + folkrock
, data=songs)
forwardStepwise = step(start_mod,
                       scope=list(upper=full_mod,lower=empty_mod),
                       direction='forward')
summary(forwardStepwise)

#Feature engineering (Stepwise)
start_mod = lm(rating~1,data=songs)
empty_mod = lm(rating~1,data=songs)
full_mod = lm(rating~track_duration + track_explicit + danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness+ liveness+ valence + tempo+  time_signature+ mellowgold + softrock + adultstandards + rock + dancepop + pop + brillbuildingpop + soul + motown + folkrock
, data=songs)
hybridStepwise = step(start_mod,
                      scope=list(upper=full_mod,lower=empty_mod),
                      direction='both')
summary(hybridStepwise)

For the Subset Selection Method, I chose all main predictors along with top 10 genres. I deployed two methods: Stepwise Variable Selection and Forward Selection. Both methods showed the similiar results of which 3 out of 24 predictors were removed and left 21 predictors as the main predictors:

pop + acousticness + rock + track_explicit + track_duration + loudness + energy + danceability + softrock + valence + instrumentalness + time_signature + adultstandards + dancepop + brillbuildingpop + liveness + key + tempo + soul + mellowgold + folkrock

As a second validation, I also deployed Shrinkage Lasso methodology with the same set of predictors (all main predictors with top 10 genres):

#Shift into new data with top 10 genres
songs_clean <- songs %>% mutate(track_explicit = case_when(track_explicit == "FALSE" ~ 0,
                                            track_explicit == "TRUE" ~1 )) %>%  
  
                          select(id, rating, track_duration , track_explicit , danceability , energy , key , loudness , mode , speechiness , acousticness , instrumentalness, liveness, valence , tempo,  time_signature, mellowgold , softrock , adultstandards , rock , dancepop , pop , brillbuildingpop , soul , motown , folkrock)

#Lasso Feature Selection
library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-4

x = model.matrix(rating~.-1,data=songs_clean)
y = songs$rating
set.seed(617)

set.seed(617)
cv_lasso = cv.glmnet(x = x, 
                     y = y, 
                     alpha = 1,
                     type.measure = 'mse')

coef(cv_lasso, s = cv_lasso$lambda.1se) %>%
  round(4)

## 40 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)      33.4600
## id                .     
## track_duration    0.0000
## track_explicit    3.0080
## danceability     10.2208
## energy           -1.9381
## key0              .     
## key1              0.3116
## key2             -0.3311
## key3              .     
## key4              .     
## key5              .     
## key6              .     
## key7             -0.0357
## key8              .     
## key9              .     
## key10            -0.0160
## key11             .     
## loudness          0.4755
## mode1             .     
## speechiness       .     
## acousticness     -3.2775
## instrumentalness -4.6853
## liveness         -2.4505
## valence          -5.5397
## tempo             0.0023
## time_signature1   .     
## time_signature3  -1.2142
## time_signature4   0.8757
## time_signature5   .     
## mellowgold        1.1472
## softrock          1.5924
## adultstandards    1.5265
## rock              5.5877
## dancepop          2.4181
## pop               8.4897
## brillbuildingpop -0.9052
## soul             -0.3763
## motown            .     
## folkrock          .

According to the results, the number of coefficients have been forced to exactly zero. The predictors were less than and different from that of the Subset Selection method. I explored the modeling technique using the features from each of the methods and assessed the RMSE. Although the feature selection ease the computational complexity, using all potential predictors (all main predictors with top 10 genres) without removing any of the predictors generates lower RMSE which will be shown in the final model later in the end of this report.

Data Splitting

To assess the potential RMSE, I splitted the main dataset into training and test sets. These subsets was used to evaluate the model in subsequent phases of analysis using RMSE. The data was splitted with 75:25 ratio in which 75% of the data was assigned to the train set and 25% of the remaining data to the test set. Below is the splitting and statistics output:

songs_og = read.csv('analysisData.csv')
songs_og <- songs_og %>% 
  mutate(mode = as.factor(mode),
         key = as.factor(key), 
         time_signature = as.factor(time_signature))

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(caret)

## Loading required package: lattice

set.seed(1031)
split = createDataPartition(y = songs_og$rating, p = 0.75, list = F,groups = 10)
train = songs_og[split,]
test = songs_og[-split,]

Parameters tuning and model evaluation

Aiming to find the best performing model, I performed subsequent processes from exploring all modeling technique to choosing the best model with the lowest RMSE. In the preliminary phase, I performed all modeling technique and excluded genre to avoid possible computational complexity. Nonetheless, I realized that there’s a certain point where the RMSE can’t reach its highest potential with the maximum RMSE (without including genre) of 15.1. With such limitation, I was sure that there’s an untapped potential in incorporating genre variable into the model. I decided to clean and transform the data to assure that all genres could be dissected and included within the model. Below are the details of the processes:

1. Model exploration

Within this particular phase, I executed approximately 17 models across several modeling techniques. I used Regression Tree (default and tuned), Bootstrap Aggregation (BAG), Random Forest (default and tuned), and Boost with all variables except genre as predictors. All of the steps and missteps are included within Preliminary Perfect Tune RMD file that I submitted separately. I curated some of the models, below is the top 3 model with the lowest RMSE:

1. Regression Tree (rpart) with cp = 0.001, minsplit = 150, minbucket = 30, maxdepth = 20 (RMSE: 15.4)
1. Tuned Boost with gbm package (RMSE: 15.2)
1. Tuned ranger (RMSE: 15.0)

#Data Pre-Processing
train_boost <- train %>% mutate(track_explicit_factor2 =
                                  case_when(track_explicit == "FALSE" ~ 0,
                                            track_explicit == "TRUE" ~1)) %>%
                        mutate(track_explicit_factor2 = as.factor(track_explicit_factor2))

#1. Regression Tree (rpart) with cp = 0.001, minsplit = 150, minbucket = 30, maxdepth = 20 (RMSE: 15.47)
library(rpart); library(rpart.plot) ; library(Metrics)
set.seed(100)
tree5 = rpart(rating~track_duration + track_explicit_factor2  + danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness+liveness + valence + tempo +  time_signature,data = train_boost, method = 'anova', control = rpart.control(cp = 0.001, minsplit = 150, minbucket = 30, maxdepth = 20))
pred16 = predict(tree5)
rmse(actual = train_boost$rating,
     predicted = pred16)

#2. Tuned Boost with gbm package (RMSE: 15.2)

#Grid Search
library(caret)
set.seed(1031)
trControl = trainControl(method="cv",number=5)
tuneGrid = expand.grid(n.trees = 500,
                       interaction.depth = c(1,2,3),
                       shrinkage = (1:100)*0.001,
                       n.minobsinnode=c(5,10,15))
garbage = capture.output(cvModel <- train(rating~track_duration + track_explicit_factor2  + danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness+liveness + valence + tempo +  time_signature,
                                          data=train_boost,
                                          method="gbm",
                                          trControl=trControl,
                                          tuneGrid=tuneGrid))

#Modelling
set.seed(1031)
boost_tuned = gbm(rating~track_duration + track_explicit_factor2  + danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness+liveness + valence + tempo +  time_signature,
              data=train_boost,
              distribution="gaussian",
              n.trees=500,
              interaction.depth=cvModel$bestTune$interaction.depth,
              shrinkage=cvModel$bestTune$shrinkage,
              n.minobsinnode = cvModel$bestTune$n.minobsinnode)

pred_train11 = predict(boost_tuned, n.trees=500)
rmse_train_cv_boost = sqrt(mean((pred_train11 - train_boost$rating)^2)); rmse_train_cv_boost

#2. Tuned ranger (RMSE: 15.0) 

#Grid Search
library(caret)
set.seed(1031)
trControl_b=trainControl(method="cv",number=5)
tuneGrid_b = expand.grid(mtry=1:10,
                       splitrule = c('variance','extratrees','maxstat', 'beta'),
                       min.node.size = c(2,5,10,15,20,25))
cvModel2 = train(rating ~ acousticness + loudness + energy + track_duration +
    danceability + valence + instrumentalness + time_signature +
    liveness + key + track_explicit + tempo + speechiness,
                data=train,
                method="ranger",
                num.trees=1000,
                trControl=trControl_b,
                tuneGrid=tuneGrid_b)

# set.seed(1031)
library(ranger)
cv_forest_ranger = ranger(rating~track_duration + track_explicit  + danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness+liveness + valence + tempo +  time_signature,
                          data=train,
                          num.trees = 1000,
                          mtry=cvModel2$bestTune$mtry,
                          min.node.size = cvModel2$bestTune$min.node.size,
                          splitrule = cvModel2$bestTune$splitrule)

pred_train9 = predict(cv_forest_ranger, data = train, num.trees = 1000)
rmse_train_cv_forest_ranger = sqrt(mean((pred_train9$predictions - train$rating)^2))
rmse_train_cv_forest_ranger

In this phase I acknowledge two things. Firstly, knowing a possible range of RMSE from deploying predictive models without genre as predictors made me realize that whatever model I chose, without considering genre, the maximum RMSE was only 15. Secondly, Random Forest model (ranger/randomForest libraries) provides the best predictive result for this data set and thus I would explore this particular model going forward.

2.Incorporating genre as predictors and performing Random Forest with both ranger and randomForest library

After knowing the right model to choose, I performed Random Forest technique with both ranger and randomForest library and compared each result separately with total 10 simulations (details in Final Modelling Perfect Tune.Rmd). I used ranger at first and tried different subsets of genres ranging from top 10 to top 200. For simplicity, I chose only the top 10 and top 200 genres with ranger library to be displayed in this report.

#Formatting Test Data
scoringData = read.csv('scoringData.csv')
scoringData$genre<-gsub("\\[|\\]","",as.character(scoringData$genre))
scoringData$genre<-gsub("'","",as.character(scoringData$genre))
scoringData$genre<-gsub(" ","",as.character(scoringData$genre))
scoringData %<>% mutate(genre_clean = genre) %>% separate_rows(genre_clean, sep = ",")
scoringData <- scoringData %>% mutate(pa = 1)
scoringData$genre_clean[scoringData$genre_clean == ""] <- "No Genre"
scoringData <- scoringData %>% pivot_wider(names_from = genre_clean, values_from = pa, values_fill = 0)
scoringData_clean <- scoringData %>% mutate(track_explicit = case_when(track_explicit == "FALSE" ~ 0,
                                                                        track_explicit == "TRUE" ~1 ))

1st Approach: Using top 10 Genres (RMSE: 14.77724)

#Shift into new data with top 10 genres
songs_clean <- songs %>% mutate(track_explicit = case_when(track_explicit == "FALSE" ~ 0,
                                            track_explicit == "TRUE" ~1 )) %>%  
  
                          select(id, rating, track_duration , track_explicit , danceability , energy , key , loudness , mode , speechiness , acousticness , instrumentalness, liveness, valence , tempo,  time_signature, mellowgold , softrock , adultstandards , rock , dancepop , pop , brillbuildingpop , soul , motown , folkrock)

#Grid Search
library(caret)
library(ranger)
library(randomForest)
set.seed(1031)
trControl_c= trainControl(method="cv",number=5)
tuneGrid_c = expand.grid(mtry=1:14,
                       splitrule = c('variance','extratrees','maxstat'),
                       min.node.size = c(2,5,10,15,20,25))

cvModel3 = train(rating ~ acousticness + loudness + energy + track_duration +
    danceability + valence + instrumentalness + time_signature +
    liveness + track_explicit + tempo + mellowgold + softrock + adultstandards + rock  + dancepop + pop  + brillbuildingpop + soul ,
                data=songs_clean,
                method="ranger",
                num.trees=1000,
                trControl=trControl_c,
                tuneGrid=tuneGrid_c)

#PCA
set.seed(1031)
library(ranger)
cv_forest_ranger = ranger(rating ~ acousticness + loudness + energy + track_duration +
    danceability + valence + instrumentalness + time_signature +
    liveness + track_explicit + tempo + mellowgold + softrock + adultstandards + rock  + dancepop + pop  + brillbuildingpop + soul,
                          data=songs,
                          num.trees = 1000,
                          mtry=cvModel3$bestTune$mtry,
                          min.node.size = cvModel3$bestTune$min.node.size,
                          splitrule = cvModel3$bestTune$splitrule)

# #Predict
# pred_submit_1 = predict(cv_forest_ranger, data = scoringData_clean,num.trees = 1000 )
# submissionFile_1 = data.frame(id = scoringData_clean$id, rating = pred_submit_1)
# write.csv(submissionFile_1, 'submission_1.csv', row.names = F)

Final Approach: Top 200 Genres (RMSE: 14.50943)

songs_clean2 <- songs %>% mutate(track_explicit = case_when(track_explicit == "FALSE" ~ 0,
                                            track_explicit == "TRUE" ~1 )) %>%

                          select(id, rating, track_duration , track_explicit , danceability , energy , key , loudness , mode , speechiness , acousticness , instrumentalness, liveness, valence , tempo,  time_signature, mellowgold , softrock , adultstandards , rock , dancepop , pop , brillbuildingpop , soul , motown , folkrock, poprap , albumrock , rap , classicrock , quietstorm , hiphop , bubblegumpop , country , funk , 'rock-and-roll')

#Grid Search
library(caret)
library(ranger)
library(randomForest)
set.seed(1031)
trControl_d= trainControl(method="cv",number=5)
tuneGrid_d = expand.grid(mtry=1:10,
                       splitrule = c('variance','extratrees','maxstat'),
                       min.node.size = c(2,5,10,15,20,25))

#Model Construction
cvModel4 = train(rating ~ acousticness + loudness + energy + track_duration +
    danceability + valence + instrumentalness + time_signature +
    liveness + track_explicit + tempo + mellowgold + softrock + adultstandards + rock  + dancepop + pop  + brillbuildingpop + soul + motown+ rap+ quietstorm +country + `rock-and-roll`,
                data=songs_clean2,
                method="ranger",
                num.trees=1000,
                trControl=trControl_d,
                tuneGrid=tuneGrid2_d)

#PCA
# set.seed(1031)
library(ranger)
cv_forest_ranger7 = ranger(rating ~ acousticness + loudness + energy + track_duration +
    danceability + valence + instrumentalness + time_signature +
    liveness + track_explicit + tempo + mellowgold + softrock + adultstandards + rock  + pop  + brillbuildingpop + soul + motown+ rap+ quietstorm +country +classicrock + hiphop + bubblegumpop +  blues + jazz + folkrock + hardrock + focus + urbancontemporary + funk + contemporarycountry + countryroad + newwavepop + countryrock + trap + southernsoul + poprock + disco +  folk + southernhiphop + lounge + classicsoul + dancepop + rockabilly + artrock + rhythmandblues + merseybeat + psychedelicrock + rootsrock + newjackswing + classicukpop + yachtrock + newromantic + gangsterrap + newwave + easylistening + neosoul + moderncountryrock + heartlandrock + permanentwave + alternativehiphop + vocaljazz +  europop + britishinvasion + countrydawn + alternativemetal  + canadianpop + phillysoul + symphonicrock + hollywood + synthpop + hardcorehiphop + glammetal + traditionalfolk + dancerock + nashvillesound + protopunk + modernrock + pianorock + metal + numetal + northernsoul + jazzfunk + boyband + dirtysouthrap + soulblues + sunshinepop + glamrock + lilith + gleeclub + classicgirlgroup + melodicrap + jazzblues + oklahomacountry + chicagosoul + edm + progressiverock + eastcoasthiphop + miamihiphop +  electricblues + girlgroup + countrypop + viralpop + canadianhiphop + chicagorap + torontorap + freestyle + classicgaragerock + tropicalhouse + classiccanadianrock + hiphouse + powerpop + queenshiphop + redneck + bluesrock + ukpop + westcoastrap + southernrock + alternativerock +
      classiccountrypop + latin + conscioushiphop + electropop + latinpop + outlawcountry + indiepop + eastcoasthiphop + poppunk + minneapolissound + albumrock + deepadultstandards + artrock + stompandholler + swing + detroithiphop + neomellow +  arkansascountry + beachmusic + indierock + phillyrap + tropical + neworleansrap + britishblues,
                          data=songs,
                          num.trees = 1000,
                          mtry=cvModel4$bestTune$mtry,
                          min.node.size = cvModel4$bestTune$min.node.size,
                          splitrule = cvModel4$bestTune$splitrule)

The insight gained from the simulations was genre variable is indeed has high predictive power. As the number of genres increases, the RMSE will eventually decrease. The incremental improvement in RMSE, however, will gradually decrease from adding more genre to the model. Additionally, albeit using ranger library generates RMSE improvement, this modeling technique requires a considerably large amount of time. For that reason, I shifted into randomForest library to explore other option of Random Forest technique in the next steps.

#randomForest library with top 30 genres (RMSE: 16.4038)
library(randomForest)
set.seed(617)
cvForest2 = randomForest(rating ~ acousticness + loudness + energy + track_duration +
    danceability + valence + instrumentalness + time_signature +
    liveness + track_explicit + tempo + mellowgold + softrock + adultstandards + rock  + dancepop + pop  + brillbuildingpop + soul + motown+ rap+ quietstorm +country + albumrock + rap + classicrock + hiphop + bubblegumpop + country + funk + blues + jazz + softrock + folkrock + hardrock + focus + dancepop + urbancontemporary + funk + contemporarycountry + countryroad + newwavepop + countryrock + trap + southernsoul,data=songs,ntree = 1000,mtry=9)

# pred_submit_final9 = predict(cvForest2, newdata = scoringData_clean,ntree= 1000 )
# submissionFile_final9 = data.frame(id = scoringData_clean$id, rating = pred_submit_final9)
# write.csv(submissionFile_final9, 'submission_final9.csv', row.names = F)

From above simulations, using randomForest library both generated lower RMSE and drastically improved time consumption to run Random Forest model. Therefore, I used this technique to perfom the best predictive model I could generate for this dataset.

3. Choosing the best model that incorporate the most popular genres

In this step I already knew the right modeling technique (Random Forest with randomForest library), executed grid research to find the best parameters, and acknowledged the fact that genre has high predictive power. The next step was to perform the best model by adding top 200 genres as predictors.

# RMSE: 14.44328
library(randomForest)
set.seed(617)
cvForest3 = randomForest(rating~ acousticness + loudness + energy + track_duration +
    danceability + valence + instrumentalness + time_signature +
    liveness + track_explicit + tempo + mellowgold + softrock + adultstandards + rock  + pop  + brillbuildingpop + soul + motown+ rap+ quietstorm +country +classicrock + hiphop + bubblegumpop +  blues + jazz + folkrock + hardrock + focus + urbancontemporary + funk + contemporarycountry + countryroad + newwavepop + countryrock + trap + southernsoul + poprock + disco +  folk + southernhiphop + lounge + classicsoul + dancepop + rockabilly + artrock + rhythmandblues + merseybeat + psychedelicrock + rootsrock + newjackswing + classicukpop + yachtrock + newromantic + gangsterrap + newwave + easylistening + neosoul + moderncountryrock + heartlandrock + permanentwave + alternativehiphop + vocaljazz +  europop + britishinvasion + countrydawn + alternativemetal  + canadianpop + phillysoul + symphonicrock + hollywood + synthpop + hardcorehiphop + glammetal + traditionalfolk + dancerock + nashvillesound + protopunk + modernrock + pianorock + metal + numetal + northernsoul + jazzfunk + boyband + dirtysouthrap + soulblues + sunshinepop + glamrock + lilith + gleeclub + classicgirlgroup + melodicrap + jazzblues + oklahomacountry + chicagosoul + edm + progressiverock + eastcoasthiphop + miamihiphop +  electricblues + girlgroup + countrypop + viralpop + canadianhiphop + chicagorap + torontorap + freestyle + classicgaragerock + tropicalhouse + classiccanadianrock + hiphouse + powerpop + queenshiphop + redneck + bluesrock + ukpop + westcoastrap + southernrock + alternativerock +
      classiccountrypop + latin + conscioushiphop + electropop + latinpop + outlawcountry + indiepop + eastcoasthiphop + poppunk + minneapolissound + albumrock + deepadultstandards + artrock + stompandholler + swing + detroithiphop + neomellow +  arkansascountry + beachmusic + indierock + phillyrap + tropical + neworleansrap + britishblues,data=songs,ntree = 1000,mtry=14)

# pred_submit_final10 = predict(cvForest3, newdata = scoringData_clean,ntree= 1000 )
# submissionFile_final10 = data.frame(id = scoringData_clean$id, rating = pred_submit_final10)
# write.csv(submissionFile_final10, 'submission_final10.csv', row.names = F)

The above model generated the final model that I generated with the lowest RMSE. With such model the private score of RMSE: 14.3587 and public score of RMSE: 14.44328.

Overal Assessment

What you did right with the analysis

Taking time to explore all prediction techniques prior to choosing the right one (lowest RMSE) – different data structures requires different prediction techniques
Dividing data into Train and Test in exploration phase to assess RMSE and deploying all data to train to be trained in the best performing model
Executing feature engineering in the preliminary modeling step and using all predictors to perform after knowing the best model

Where you went wrong

Spending the majority of my time doing grid search in every model - the process can be slow
Not using Genre as predictors when deploying feature engineering and exploring prediction technique
Using ranger library instead of randomForest function in the model tuning phase. In this case, ranger have a tendency to be computationally intensive but with less RMSE

What you would do differently

Use all Genres as predictors as soon as possible - starting from feature selection phase
Another option is using pareto principle. For instance, 20% of top genre already covering 80% of the cases
Deleting outlier in predictors with high predictive power (variable of importance)
Although it is computationally extensive, I would explore more of the Best Subset Selection Method to pick the best performing model
Deploy grid search for a certain frequency, not too many, and use the result as a benchmark in trying predictive models

Perfect Tune Modeling