APAN5200 - Predictive Analysis Competition Report

Code is embedded into the body of the report for understanding and illustration purposes.

When I started the competition, I immediately was in the mindset to understand the linear relationships with the predictors provided and the predicted ranking. From reviewing the structure of the the songs and scoring datasets, I created a correlation matrix with the usable variables, looking for strong relationships with the predicted variable, rating. Without manipulating the datasets, usable non-character variables were columns 5 through 18.

# Library Loading
library(tidyr); library(dplyr); library(ggplot2); library(lattice); library(caret)
library(broom); library(car); library(leaps); library(glmnet); library("randomForest")
library(mgcv); library(stringr)

#Loading the data
songs <- read.csv('/Users/jacobbrandt/Downloads/University/Columbia/CU Y5S1/APAN5200 - Framework Methods/Predictive Competition/lalasongs22/analysisData.csv', header = TRUE)
str(songs)

# Splitting Data
set.seed(1031)
split = createDataPartition(y=songs$rating,p = 0.7,list = F,groups = 100)
train = songs[split,]
nrow(train)
test = songs[-split,]
nrow(test)

# Correlation Matrix
library(ggcorrplot)
ggcorrplot(cor(train[,c(5:18)]),
           type = 'lower',
           show.diag = F,
           colors = c('red','white','darkgreen'))
corMatrix = as.data.frame(cor(train[,c(5:19)]))
corMatrix$var1 = rownames(corMatrix)
correlation_plot <- corMatrix %>%
  gather(key=var2,value=r,1:15)%>%
  arrange(var1,desc(var2))%>%
  ggplot(aes(x=var1,y=reorder(var2, order(var2,decreasing=F)),fill=r))+
  geom_tile()+
  geom_text(aes(label=round(r,2)),size=3)+
  scale_fill_gradientn(colours = c('#d7191c','#fdae61','#ffffbf','#a6d96a','#1a9641'))+
  theme(axis.text.x=element_text(angle=75,hjust = 1))+xlab('')+ylab('')

Although I did not record my first attempt, I recorded each attempt after. In my second attempt, using some of the correlated predictors–like track_explicit, speechiness, track_duration, time_signature, loudness, energy, mode, and tempo–on a basic linear regression model, lm(), I got an RMSE score of 15.77155. After that attempt, I got more strategic in my approach and started testing the recommended predictors identified from running the three Stepwise Regression models we learned in class–Forward, Backward, and Hybrid–as well as subset selection with NVMAX set to 6 and without NVMAX, lasso, principal component analysis, and ridge regressions. Once testing each of these approaches, I started to apply them to the generalized additive model, gam(), with the REML method and once with the GACV.Cp method. Applying a smoothing factor to the highest correlated variables, and including almost every variable from columns 5 through 18, the gam() achieved an RMSE score of 15.40908.

model <- gam(rating ~ s(loudness) + s(track_duration) + danceability + valence + s(acousticness) + s(instrumentalness) + time_signature + s(energy) + track_explicit + s(liveness) + tempo + speechiness + mode, data = train, method = 'REML')

Through trial and error, this was the best RMSE score I could attain without major data manipulation, and just doing a 70-30 split on the song dataset into train and test. Continuing with this, I attempted a randomForest with 500 trees and only using columns 5 through 18, simply substituting randomForest into the location of the gam() model. This resulted in a score of 15.22329. It was not until after this, I realized that the songs dataset was the train dataset and the scoringData dataset was the test dataset; then I researched how to handle the missing data in the genre column and how to separated any string of text so that the genres could be turned into dummy variables, which as we learned from class, is the best for modeling purposes.

In this new version of my model, I used a tuned ranger random forest with 50 trees on cleaned data. Immediately, I took care of missing values in the genre column by inserting a value called “No Genre”, then I separated the genre types into dummy variables by using a string detection (str_detect()) for more common types such as pop, rock, rap, hip hop, country, folk, adult standards, or funk. After doing this, I figured the same strategy could be applied to the performer column, which created columns for the Glee Cast, Taylor Swift, Drake, and The Beatles, who had the highest number of songs on the songs dataset. Now, the new train dataset contains 31 columns, and the test dataset contains 30. This was my best attempt as it achieved a score of 14.91537.

# Scored in competition. This was my best attempt. This is the entire model.
### Tuned Ranger Forest 50 Trees, Separating Performer and Genre

## Cleaning the test data
library(tidyr); library(dplyr); library(ggplot2); library(lattice); library(caret); library(broom); library(car); library(leaps); library(glmnet)
library(mgcv); library(stringr); library("randomForest"); library(gbm); library(ranger)

songs <- read.csv('/Users/jacobbrandt/Downloads/University/Columbia/CU Y5S1/APAN5200 - Framework Methods/Predictive Competition/lalasongs22/analysisData.csv', header = TRUE)
str(songs)
songs[10,]

songs$genre[songs$genre == '[]'] <- "No Genre"
songs[10,]
sum(is.na(songs$genre))

songs$genre[is.na(songs$genre)] <- "No Genre" 
sum(is.na(songs$genre))


songs_build_genre <- songs %>%
  mutate( pop = ifelse(str_detect(genre, 'pop'), 1, 0),
          rock = ifelse(str_detect(genre, 'rock'),1,0),
          rap = ifelse(str_detect(genre, 'rap'),1,0),
          hip_hop = ifelse(str_detect(genre, 'hip hop'),1,0),
          country = ifelse(str_detect(genre, 'country'),1,0),
          folk = ifelse(str_detect(genre, 'folk'),1,0),
          adult_standards = ifelse(str_detect(genre, 'adult standards'),1,0),
          funk = ifelse(str_detect(genre, 'funk'),1,0),
          Glee_Cast = ifelse(str_detect(performer, 'Glee Cast'),1,0),
          T_Swift = ifelse(str_detect(performer, 'Taylor Swift'),1,0),
          Drake = ifelse(str_detect(performer, 'Drake'),1,0),
          Beatles = ifelse(str_detect(performer, 'The Beatles'),1,0),
          explicit = ifelse(str_detect(genre, 'TRUE'),1,0))
str(songs_build_genre)
songs_build_genre[10,]
str(songs_build_genre[,c(5, 7:32)])

# To see the most common performers in the dataset:
artist <- songs_build_genre %>%
  group_by(performer) %>%
  count()
artist2 <- as.data.frame(artist)
artist3 <- artist2 %>%
  arrange(desc(n))
artist3
##

## Cleaning the test data
scoringData = read.csv('/Users/jacobbrandt/Downloads/University/Columbia/CU Y5S1/APAN5200 - Framework Methods/Predictive Competition/lalasongs22/scoringData.csv')

scoringData$genre[scoringData$genre == '[]'] <- "No Genre"
sum(is.na(scoringData$genre))

scoringData$genre[is.na(scoringData$genre)] <- "No Genre" 
sum(is.na(scoringData$genre))

scoringData_build_genre <- scoringData %>%
  mutate( pop = ifelse(str_detect(genre, 'pop'), 1, 0),
          rock = ifelse(str_detect(genre, 'rock'),1,0),
          rap = ifelse(str_detect(genre, 'rap'),1,0),
          hip_hop = ifelse(str_detect(genre, 'hip hop'),1,0),
          country = ifelse(str_detect(genre, 'country'),1,0),
          folk = ifelse(str_detect(genre, 'folk'),1,0),
          adult_standards = ifelse(str_detect(genre, 'adult standards'),1,0),
          funk = ifelse(str_detect(genre, 'funk'),1,0),
          Glee_Cast = ifelse(str_detect(performer, 'Glee Cast'),1,0),
          T_Swift = ifelse(str_detect(performer, 'Taylor Swift'),1,0),
          Drake = ifelse(str_detect(performer, 'Drake'),1,0),
          Beatles = ifelse(str_detect(performer, 'The Beatles'),1,0),
          explicit = ifelse(str_detect(genre, 'TRUE'),1,0))
nrow(scoringData_build_genre)
str(scoringData_build_genre)
str(scoringData_build_genre[,c(5, 7:31)])
##


## Trying a model
trControl=trainControl(method="cv",number=5)
tuneGrid = expand.grid(mtry=1:4, 
                       splitrule = c('variance','extratrees','maxstat'), 
                       min.node.size = c(2,5,10,15,20,25))
set.seed(617)
cvModel = train(rating~.,
                data=songs_build_genre[,c(5, 7:32)],
                method="ranger",
                num.trees=50,
                trControl=trControl,
                tuneGrid=tuneGrid ) # will take a while to run (computing power)
cv_forest_ranger = ranger(rating~.,
                          data=songs_build_genre[,c(5, 7:32)],
                          num.trees = 50, 
                          mtry=cvModel$bestTune$mtry, 
                          min.node.size = cvModel$bestTune$min.node.size, 
                          splitrule = cvModel$bestTune$splitrule)

## Making Predictions

pred = predict(cv_forest_ranger, data= scoringData_build_genre[,c(5, 7:31)], num.trees = 50)
str(pred)
#rmse_cv_forest_ranger = sqrt(mean((pred$predictions-scoringData_build_genre$rating)^2)); rmse_cv_forest_ranger


## Writing the submission file
submissionFile = data.frame(id = scoringData_build_genre$id, rating = pred$predictions)
submissionFile
write.csv(submissionFile, '/Users/jacobbrandt/Downloads/University/Columbia/CU Y5S1/APAN5200 - Framework Methods/Predictive Competition/Jacob_Brandt_submission.csv',row.names = F)
#row.names = F will eliminate the default R column that labels the number of each row

#14.91537 publicly scored
#14.99145 privately scored

After the competition concluded, I realized a major misconception that could explain why my models, though they were improving, did not see a reflective improvement in RMSE. This was because after going through the regression models, I neglected that the songs and scoring datasets were not split 70-30 that we were taught helped improve model predictions. In my linear model approaches, initial splitting allowed for predictions because I split on the songs dataset that contained the ranking figured, this was called “predPRE_test” and it allowed for an estimation of RMSE. When the models became more complicated, because the Scoring dataset did not contain the ranking column, I was not able to run that same estimation. This miscalculation means that had there been that initial split with the initial Songs dataset, it would have been easier to understand the effectiveness of a model before being applied to the Scoring data.

Hence, the following should have been included in the model:

# Splitting Data
set.seed(1031)
split = createDataPartition(y=songs$rating,p = 0.7,list = F,groups = 100)
train = songs[split,]
nrow(train)
test = songs[-split,]
nrow(test)

What probably reinforced this misconception was when I attempted a model with XGBoost. I was running into problems with respect to the genre dummy variables. After researching, I found out the following lines can create dummy variables out of all the genres, however, not all genres were shared between the songs and scoring datasets. Additionally, through this method for genre separation, some genres were not grouped into their unique values, so there was repetition of the genres that created eight thousand total columns in the songsData and about eight hundred columns for the scoringData. In such case, to avoid this issue in the future, I would look into how to force recognition of unique values and what type of joins, such as inner-join, could be applied across the two datasets.

scoringData$genre[scoringData$genre == '[]'] <- "No Genre"
sum(is.na(scoringData$genre))

scoringData$genre[is.na(scoringData$genre)] <- "No Genre" 
sum(is.na(scoringData$genre))

scoring_Gcol <- scoringData %>%
  mutate(clean_genre = gsub("\\[|\\]", "", scoringData$genre)) # Removes the brackets from the elements in the genre column
#scoring_Gcol[3,] checking to see if it worked

scoring_build_genre <- scoring_Gcol %>%
  mutate(row = row_number()) %>%
  separate_rows(clean_genre, sep = ',') %>%
  pivot_wider(names_from = clean_genre, values_from = clean_genre, 
              values_fn = function(x) 1, values_fill = 0) %>%  # This pivot_wider() section of the mutate, can separate the genres. The same can be achieved for the 'performer' column
  select(-row) %>%
  mutate(track_explicit, explicit = ifelse(str_detect(genre, 'TRUE'),1,0)) # This line is not needed, but to run the XGBoost, I found that the last column needed to be used as the label, according to online sources
scoring_build_genre[is.na(scoring_build_genre)] <- 0

Despite these challenges, this competition helped me hone my skills in R, and through a series of 31 trial and error, my conceptualization of how functions work together step-by-step to build effective and efficient models.

Based on what I learned from this competition and from my mistakes, I decided to revise the code used in my best submission and present it below. Ideally, I think I would have continued to pursue my methods with the genre cleaning, and find a way to separate those genres while group unique genres together.

### Improved model - Ran in Kaggle Cloud Coding Notebook
library(tidyr); library(dplyr); library(ggplot2); library(lattice); library(caret); library(broom); library(car); library(leaps); library(glmnet)
library(mgcv); library(stringr); library("randomForest"); library(gbm); library(ranger)

songs <- read.csv('../input/lalasongs22/analysisData.csv', header = TRUE)
scoringData = read.csv('../input/lalasongs22/scoringData.csv')

## Cleaning the test data
songs$genre[songs$genre == '[]'] <- "No Genre"
songs[10,]
sum(is.na(songs$genre))

songs$genre[is.na(songs$genre)] <- "No Genre" 
sum(is.na(songs$genre))

songs_build_genre <- songs %>%
  mutate( pop = ifelse(str_detect(genre, 'pop'), 1, 0),
          rock = ifelse(str_detect(genre, 'rock'),1,0),
          rap = ifelse(str_detect(genre, 'rap'),1,0),
          hip_hop = ifelse(str_detect(genre, 'hip hop'),1,0),
          country = ifelse(str_detect(genre, 'country'),1,0),
          folk = ifelse(str_detect(genre, 'folk'),1,0),
          adult_standards = ifelse(str_detect(genre, 'adult standards'),1,0),
          funk = ifelse(str_detect(genre, 'funk'),1,0),
          Glee_Cast = ifelse(str_detect(performer, 'Glee Cast'),1,0),
          T_Swift = ifelse(str_detect(performer, 'Taylor Swift'),1,0),
          Drake = ifelse(str_detect(performer, 'Drake'),1,0),
          Beatles = ifelse(str_detect(performer, 'The Beatles'),1,0),
          explicit = ifelse(str_detect(genre, 'TRUE'),1,0))

scoringData$genre[scoringData$genre == '[]'] <- "No Genre"
sum(is.na(scoringData$genre))

scoringData$genre[is.na(scoringData$genre)] <- "No Genre" 
sum(is.na(scoringData$genre))

scoringData_build_genre <- scoringData %>%
  mutate( pop = ifelse(str_detect(genre, 'pop'), 1, 0),
          rock = ifelse(str_detect(genre, 'rock'),1,0),
          rap = ifelse(str_detect(genre, 'rap'),1,0),
          hip_hop = ifelse(str_detect(genre, 'hip hop'),1,0),
          country = ifelse(str_detect(genre, 'country'),1,0),
          folk = ifelse(str_detect(genre, 'folk'),1,0),
          adult_standards = ifelse(str_detect(genre, 'adult standards'),1,0),
          funk = ifelse(str_detect(genre, 'funk'),1,0),
          Glee_Cast = ifelse(str_detect(performer, 'Glee Cast'),1,0),
          T_Swift = ifelse(str_detect(performer, 'Taylor Swift'),1,0),
          Drake = ifelse(str_detect(performer, 'Drake'),1,0),
          Beatles = ifelse(str_detect(performer, 'The Beatles'),1,0),
          explicit = ifelse(str_detect(genre, 'TRUE'),1,0))
##

# Splitting Dataset
set.seed(1031)
split = createDataPartition(y=songs_build_genre$rating,p = 0.7,list = F,groups = 100)
train = songs_build_genre[split,]
nrow(train)
test = songs_build_genre[-split,]
nrow(test)

## Trying a model
trControl=trainControl(method="cv",number=5)
tuneGrid = expand.grid(mtry=1:4, 
                       splitrule = c('variance','extratrees','maxstat'), 
                       min.node.size = c(2,5,10,15,20,25))
set.seed(617)
cvModel = train(rating~.,
                data=train[,c(5, 7:32)],
                method="ranger",
                num.trees=50,
                trControl=trControl,
                tuneGrid=tuneGrid ) # will take a while to run (computing power)
cv_forest_ranger = ranger(rating~.,
                          data=train[,c(5, 7:32)],
                          num.trees = 50, 
                          mtry=cvModel$bestTune$mtry, 
                          min.node.size = cvModel$bestTune$min.node.size, 
                          splitrule = cvModel$bestTune$splitrule)

## Making Predictions

predPRE_test = predict(cv_forest_ranger, data= test[,c(5, 7:32)], num.trees = 50)
rmse_cv_forest_ranger_test = sqrt(mean((predPRE_test$predictions-test$rating)^2)); rmse_cv_forest_ranger_test

pred = predict(cv_forest_ranger, data= scoringData_build_genre[,c(5, 7:31)], num.trees = 50)
str(pred)

## Writing the submission file
submissionFile = data.frame(id = scoringData_build_genre$id, rating = pred$predictions)
#row.names = F will eliminate the default R column that labels the number of each row

str(submissionFile)

write.csv(submissionFile, '../input/lalasongs22/Jacob_Brandt_submission.csv',row.names = F)

APAN5200 - Predictive Analysis Competition Report

Jacob Brandt

2022-11-30