PSTAT 131

Winter 23

Introduction

Before we get started, I’ve embedded a short playlist featuring some of my favorite songs that can be listened to in the background while reading. Unfortunately the video couldn’t be embedded directly because it was too large, and couldn’t be embedded with YouTube because of copyright restrictions, so it’s up on Vimeo. Enjoy!

library(vembedr)
embed_url("https://vimeo.com/807585775") %>% 
  use_align("center") %>% 
  use_bs_responsive()

Project Sekai: Colorful Stage feat. Hatsune Miku is a Japanese rhythm game that can be played on phone or tablet. The player taps notes as they descend on the screen, building up a combo count and progressing their score the higher it gets. The game features five modes of difficulty: Easy, Normal, Hard, Expert, and Master. Within each difficulty rating, there is a rating value that each song can take. For example, the song “Ice Drop” is rated 7, 12, 18, 24, and 27 on its Easy, Normal, Hard, Expert, and Master difficulties, respectively.

The game also features five distinct musical units that perform most of the songs. They include “Leo/need” (LN), MORE MORE JUMP! (MMJ), Vivid BAD SQUAD (VBS), Wonderlands x Showtime (WXS), and 25-ji, Nightcord de. (25JI). Each group features its own aesthetic and music type, allowing them to uniquely express their characters. For example, MMJ is an idol group, meaning that they perform cute and preppy music. On the other hand, VBS made up of of street performers, so their music is often composed of electronic or hip hop style.

The game features 20 original characters, and 6 vocaloid “virtual singers”. The main objective of the franchise is to combine real singers with virtual singers, creating music in ways that have never been able to reach popular culture before.

The aim of this project is to develop a machine learning model that can accurately predict the number of notes in a chart on Master difficulty. We will do so by building up multiple models, feeding them data to learn with, and ultimately choosing the best one. Let’s begin!

Loading Packages and Data

First, we load the necessary packages and store the song data in the song_data variable. We also use the clean_names() function to make the predictor names more readable.

library(dplyr)
library(magrittr)
library(knitr)
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(patchwork)  # for making plots
library(corrplot)
library(discrim)
library(kknn)
library(ISLR)
library(ISLR2)
library(glmnet)
library(modeldata)
library(naniar)
library(xgboost)
library(ranger)
library(vip)
library(rpart.plot)
library(themis)
library(janitor)
library(kableExtra)
tidymodels_prefer()

# Loading the raw data and using clean_names to format column names
song_data <- read.csv("data\\song_data.csv")
song_data <- song_data %>% 
  clean_names()

Exploratory Data Analysis (EDA)

Tidying Data

We’ll start by familiarizing ourselves with the data.

dim(song_data)

## [1] 282  16

We have 282 songs to work with, and 16 parameters. Let’s take a look at the first few songs to see what the data looks like:

song_data %>% 
  head() %>% 
  kable() %>% 
  kable_styling(full_width = F) %>% 
  scroll_box(width = "100%", height = "378px")

number	id	name	difficulty_easy	difficulty_normal	difficulty_hard	difficulty_expert	difficulty_master	length	note_count_easy	note_count_normal	note_count_hard	note_count_expert	note_count_master	bpm	unit
1	1	Tell Your World	5	10	16	22	26	123.3	220	492	719	961	1147	150	VS
2	2	Roki	7	11	17	24	28	107.0	166	296	635	827	975	150	LN
3	3	Teo	9	14	19	27	32	116.4	135	477	681	997	1221	185	LN
4	6	Hibana Reloaded	9	14	19	28	32	95.0	198	394	568	808	1060	200	LN
5	8	Time Machine	6	11	16	23	26	124.5	193	268	488	719	903	126	LN
6	10	Happy Synthesizer	6	11	18	24	28	133.0	127	437	685	859	1136	127	MMJ

We can see that each song has five difficulty levels, ranging from easy to master, a certain number of notes per difficulty, a length (in seconds), and a BPM (beats per minute). Each song is also performed by a musical unit, or marked as OTHER as necessary.

Next, let’s check if we have any missing data:

vis_miss(song_data)

Fortunately for us, our data is 100% complete! This means we won’t have to impute any data later on. Thank you to whoever scraped all of this data manually, it must’ve been a lot of work (it was me).

Next, we’ll turn any non-numeric variables that we’ll be using in our models into factors. This allows for easier classification, and is necessary for when we create the recipe later.

song_data <- song_data %>% 
  mutate(unit = factor(unit, c("VS", "LN", "MMJ", "VBS", "WXS", "25JI", "OTHER")))

Visual EDA

Now that our data is nice and neat, we can start to graph and analyze it using the ggplot library! First, let’s have a look at the distributions of our song difficulties:

diff_hist_easy <- song_data %>% ggplot(aes(x = difficulty_easy)) + 
  geom_histogram(fill = "#00cdba") + 
  xlab("Easy") +
  theme_bw()

diff_hist_normal <- song_data %>% ggplot(aes(x = difficulty_normal)) + 
  geom_histogram(fill = "#00cdba") + 
  xlab("Normal") +
  theme_bw()

diff_hist_hard <- song_data %>% ggplot(aes(x = difficulty_hard)) + 
  geom_histogram(fill = "#00cdba") + 
  xlab("Hard") +
  theme_bw()

diff_hist_expert <- song_data %>% ggplot(aes(x = difficulty_expert)) + 
  geom_histogram(fill = "#00cdba") +
  xlab("Expert") +
  scale_x_continuous(breaks = seq(21, 31, by = 2)) +
  theme_bw()

diff_hist_master <- song_data %>% ggplot(aes(x = difficulty_master)) + 
  geom_histogram(fill = "#00cdba") + 
  xlab("Master") + 
  scale_x_continuous(breaks = seq(26, 38, by = 2)) +
  theme_bw()

diff_hist_easy + diff_hist_normal + diff_hist_hard + diff_hist_expert + diff_hist_master

Even though the ranges of values is fairly small, we can see that most of the difficulties are roughly normally distributed. For Hard, Expert, and Master, it is more apparent that the distribution has a right skew, meaning that the data is more concentrated toward the lower values. This should not affect our models, but it is nice to take note of. The reasoning is likely that songs are charted to be “playable” for most players. This means that higher difficulties are less common so that the average player can complete most songs.

Next, let’s take a look at the distributions for the number of notes in each difficulty. Since the numbers aren’t as clear-cut as the difficulty values, we’ll use box plots:

note_count_box_easy <- song_data %>% ggplot(aes(x = factor(1), y = note_count_easy)) + 
  geom_boxplot(fill = "white") + 
  geom_jitter(aes(color = unit)) +
  theme_bw() +
  labs(x = "Easy", y = NULL) +
  scale_color_manual(values =c("VS" = "#4455DD", 
                               "LN" = "#4455DD",
                               "MMJ" = "#4455DD",
                               "VBS" = "#4455DD",
                               "WXS" = "#4455DD",
                               "25JI" = "#4455DD",
                               "OTHER"= "#4455DD")) +
    theme(legend.position = "none")

note_count_box_normal <- song_data %>% ggplot(aes(x = factor(1), y = note_count_normal)) + 
  geom_boxplot(fill = "white") + 
  geom_jitter(aes(color = unit)) +
  theme_bw() +
  labs(x = "Normal", y = NULL) +
  scale_color_manual(values =c("VS" = "#6CCB20", 
                               "LN" = "#6CCB20",
                               "MMJ" = "#6CCB20",
                               "VBS" = "#6CCB20",
                               "WXS" = "#6CCB20",
                               "25JI" = "#6CCB20",
                               "OTHER"= "#6CCB20")) +
    theme(legend.position = "none")

note_count_box_hard <- song_data %>% ggplot(aes(x = factor(1), y = note_count_hard)) + 
  geom_boxplot(fill = "white") + 
  geom_jitter(aes(color = unit)) +
  theme_bw() +
  labs(x = "Hard", y = NULL) +
  scale_color_manual(values =c("VS" = "#EE1166", 
                               "LN" = "#EE1166",
                               "MMJ" = "#EE1166",
                               "VBS" = "#EE1166",
                               "WXS" = "#EE1166",
                               "25JI" = "#EE1166",
                               "OTHER"= "#EE1166")) +
    theme(legend.position = "none")

note_count_box_expert <- song_data %>% ggplot(aes(x = factor(1), y = note_count_expert)) + 
  geom_boxplot(fill = "white") + 
  geom_jitter(aes(color = unit)) +
  theme_bw() +
  labs(x = "Expert", y = NULL) +
  scale_color_manual(values =c("VS" = "#FF9900", 
                               "LN" = "#FF9900",
                               "MMJ" = "#FF9900",
                               "VBS" = "#FF9900",
                               "WXS" = "#FF9900",
                               "25JI" = "#FF9900",
                               "OTHER"= "#FF9900")) +
    theme(legend.position = "none")

note_count_box_master <- song_data %>% ggplot(aes(x = factor(1), y = note_count_master)) + 
  geom_boxplot(fill = "white") + 
  geom_jitter(aes(color = unit)) +
  theme_bw() +
  labs(x = "Master", y = NULL) +
  scale_color_manual(values =c("VS" = "#884499", 
                               "LN" = "#884499",
                               "MMJ" = "#884499",
                               "VBS" = "#884499",
                               "WXS" = "#884499",
                               "25JI" = "#884499",
                               "OTHER"= "#884499")) +
  theme(legend.position = "none")

note_count_box_easy + note_count_box_normal + note_count_box_hard + note_count_box_expert + note_count_box_master

Note: Colors are for aesthetic purposes only.

From these graphs, we can see that the note counts for Easy and Master are fairly condensed, with the majority of songs within a narrow range of the box plot. However for Normal, Hard, and Expert, there is more variance in note count. Again, this ties into the idea that the majority of players play within the middle three skill ranges, so they would have the most variety. Additionally, there are only so many or so few notes you can put in a song before it becomes to cluttered or too slow. As the ends of the difficulty spectrum, they are more likely to be bound to smaller ranges.

Next, since we’ll be predicting the note count on master, let’s check the relationship between note_count_mas and bpm:

song_data %>% ggplot(aes(x = bpm, y = note_count_master)) + 
  geom_point(col = "#00cdba") +
  scale_color_manual() +
  geom_smooth(method = "lm", se = FALSE, col = "#0077DD") +
  theme_bw() +
  labs(x = "BPM", y = "Note Count on Master")

This graph has an apparent positive correlation, meaning that as the BPM of a song increases, that song is more likely to have a higher number of notes on Master. Notably, the note counts skew higher as BPM increases. This can be due to the fact that higher BPM songs are often also higher difficulty, meaning that they will have more notes as an intended challenge.

Now let’s visualize the connection between length and note count on master:

song_data %>% ggplot(aes(x = length, y = note_count_master)) + 
  geom_point(col = "#00cdba") +
  scale_color_manual() +
  geom_smooth(method = "lm", se = FALSE, col = "#0077DD") +
  theme_bw() +
  labs(x = "Length (s)", y = "Note Count on Master")

Similarly to the Note Count on Master vs BPM plot, the number of notes on Master increases as the length of the song increases. This should be a fairly obvious correlation. Having more time means being able to fit more notes within a chart. To address some outliers:

The point near (162.5, 750) is “Ichi Ni no San de”, which is Master level 26 and has a BPM of 83. As it can be observed from the chart that there are a lot sections with few or no notes. Bars 001, 039, and 050 to the end are an example of this. Having a low BPM and difficulty accounts for this.

Ichi Ni no San de on Master

On the other end of the spectrum, the point near (125, 2000) is “Yaminabe!!!!”, one of the newest songs added to the game and specially commissioned as a song made to be one of the most difficult charts in the game. Despite being slightly over 2 minutes long, it is one of three charts to have over 2000 notes. That’s 15.6 notes per second, for two minutes straight! “Yaminabe!!!!” has a difficulty of 37 (the highest possible rating on master) and a BPM of 237, so the number of notes in such a short time is explained by these factors.

Yaminabe!!!! on Master. It’s a lot to take in.

Finally, let’s take a look at a predictor we haven’t discussed yet: unit. How much does the unit affect the number of notes in a chart? In other words, does the general style of music affect the note count?

song_data %>% ggplot(aes(x = unit, y = note_count_master, group = unit)) + 
  geom_boxplot() + 
  geom_jitter(aes(color = unit)) +
  theme_bw() +
  scale_color_manual(values = c("VS" = "#00cdba",
                                "LN" = "#4455DD",
                                "MMJ" = "#6CCB20",
                                "VBS" = "#EE1166",
                                "WXS" = "#FF9900",
                                "25JI" = "#884499",
                                "OTHER" = "black")) +
  theme(legend.position = "none")

This graph shows a couple interesting metrics. Let’s look at the Virtual Singer and OTHER box plots first. They have the largest boxes, indicating that there is more variation between note counts on Master. This is due to the fact that they are both miscellaneous “units”. In other words, they have no set theme of music, so they encompass everything not already done by the five main units. They also have the highest number of Master charts over 1500 notes. Looking at the OTHER category, the highest three points are “Endmark ni Kibo Namida wo Soete”, “Don’t Fight the Music”, and “The EmpErroR”. These songs are actually special collaboration songs and do not feature any singers, as they are imported from another rhythm game. They were also added with the intention of being “challenge” songs. Similarly, in the VS section, there are 15 songs with 1500 or more notes. These songs have a combined average difficulty of 33, while the average overall master difficulty is 29.

Next let’s look at the five main units. 25JI is the most visibly spread, having the highest number of Master charts under 750 notes. This supplements with their melancholic and gloomy theme. They are usually the group to take on slower songs. Conversely, WXS has the highest average number of notes per Master chart, and also the highest minimum. Their theme is circus-y, so they end up with a lot of the energetic and fast paced songs.

Setting Up the Models

Great! Now that we have a good understanding of the data we’re working with, it’s time to prepare for the main event: the models.

Splitting the Data

First, we’re going to split our data into two sets: training and testing. This allows us to first train our models with some data, and then test it on new data that our models haven’t seen before. Before we split the data, we have to set a seed. This ensures that when we run the code again, the data is split in the exact same way so that we can reproduce our results. We will also stratify the data on our outcome variable, note_count_master. This ensures that the training and testing sets both get an even range of values. We wouldn’t want all of the high count charts sitting in the testing set, because when it comes to testing, the models will be less prepared to deal with them.

set.seed(321)  # 321 is the first line of the song "Float Planner"

song_data_split <- initial_split(song_data, prop = 0.75, strata = note_count_master)
song_data_train <- training(song_data_split)
song_data_test <- testing(song_data_split)

Creating the Recipe

Now that our data is split, we can create the recipe for our models! The recipe is the backbone of the project, because all of the models use the same recipe (except for polynomial regression, which has its own recipe). As predictors, we use all of the difficulty variables, all of the note count variables (minus note_count_master), length, BPM, and unit. The remaining parameters have no influence on the data and therefore are not included in the recipe.

song_data_recipe <- recipe(note_count_master ~ difficulty_easy + difficulty_normal + difficulty_hard + difficulty_expert + difficulty_master + length + note_count_easy + note_count_normal + note_count_hard + note_count_expert + note_count_master + bpm + unit, data = song_data_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_scale(all_predictors()) %>% 
  step_center(all_predictors())  # standardizing predictors

K-Fold Cross Validation

Before we start building our models, it is very important that we first set up k-fold cross validation. Essentially, k-fold cross validation is the process of splitting our training data set into “folds”. Think of it like slicing a pie, where our training set is the whole pie and the slices are our folds. Each slice is independent of the others and there is no overlap, but when put together they form the whole pie again. The purpose of doing this is to further improve the way our models train. Because we are tuning multiple parameters for a couple of our models, splitting the training set into 10 folds allows for more opportunities to test those values. It also allows us to evaluate our models at different points in their parameter tuning, so that we can pick the best one. Having multiple samples to choose from is generally better than just one!

song_data_folds <- vfold_cv(song_data_train, v = 10, strata = note_count_master)

Building the Models

Now it’s time to actually build our models! For this project, we’ll be fitting six models, each of them with their own advantages and disadvantages in prediction:

Linear regression: In the case of this project, we use multiple linear regression, which compares multiple input variables to our outcome variable. It assumes that each connection is linear, and trains itself under that assumption. You may be wondering, how can a quantitative variable like note_count_mas be compared to a qualitative variable like unit? This is where the function step_dummy comes in in our recipe. In order to compare a qualitative and quantitative variables under linear regression, the step_dummy function creates sets of binary dummy variables (usually 0 and 1) for each possible category within that qualitative variable. For example, in the unit variable, there are seven possible categories that a data observation can fall under. Say we’re looking at three of them, LN, MMJ, and VBS. The step_dummy function would create two additional columns of 0s and 1s, to indicate true or false for two of the three possibilities. Take the song “Flyer”, by VBS. If the extra columns are given to LN and MMJ, then it would show 0 and 0, since by elimination it would be categorized as VBS.
K-nearest neighbors: Think of this model as a set of graphs comparing response variables with each other in a scatterplot, sort of like linear regression. However, instead of comparing the input and response variables linearly, we instead compare the data points with each other. First, we set our parameter value k (or tune it to a range of values). In each graph, when a new data point is tested, it is compared to the k nearest neighbors and then assigned a value based on the values of those neighbors. For example, say we had a scatterplot of bpm vs length, and set k = 5. When we insert a brand new data point, it will select the five nearest other points and evaluate the values of their note_count_mas, then being assigned the average of those values.
Polynomial regression: This model is more unique, because it uses its own recipe while still using the linear_reg() function. Instead of fitting lines based on a single degree model, it can fit based on any degree of polynomial. This value is usually tuned. This can be a significant improvement over a strictly linear model, because if the data does not fit a straight line, the polynomial fit can adjust to that with curves. However, this does lead to the risk of overfitting to the data, especially with higher degrees of polynomials.
Elastic net: The elastic net model is a combination of the lasso model and the ridge model, which are both in turn extensions of the linear regression model. In short, they aim to reduce the complexity of the model by changing different parameters; both are tuned by the penalty value.
Random forest: The random forest model uses a setup called decision trees to predict a variable. In a decision tree, a series of yes/no decisions are made based on criteria for each variable. For example, say that first a tree evaluated difficulty_master. If it is over a certain value, the tree moves down one path. If it is under, it moves down another path. This process continues with other variables until a decision is made. Thousands of trees are made per model, and results are averaged to give a final value.
Boosted trees: This model is similar to the random forest model, but instead of creating new trees each time, it uses information from previous trees to make decisions in the next.

Selecting Models

Our first step is to create the initial models. In the case of polynomial regression, we first have to create its own recipe, then fit it to a linear regression model.

# Linear regression
song_data_lm <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

# K-nearest neighbors
song_data_knn <- nearest_neighbor(neighbor = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("regression")

# Polynomial regression
song_data_recipe_poly <- song_data_recipe %>% 
  step_poly(length, bpm, degree = tune())
song_data_poly <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

# Elastic net
song_data_en <- linear_reg(mixture = tune(),
                           penalty = tune()) %>% 
  set_engine("glmnet") %>% 
  set_mode("regression")

# Random forest
song_data_rf <- rand_forest(mtry = tune(), 
                            trees = tune(), 
                            min_n = tune()) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

# Boosted trees
song_data_bt <- boost_tree(mtry = tune(), 
                           trees = tune(), 
                           learn_rate = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

Creating Workflows

Workflows essentially put together all the necessary information required to fit and predict each model.

# Linear regression
song_data_lm_wkflow <- workflow() %>% 
  add_model(song_data_lm) %>% 
  add_recipe(song_data_recipe)

# K-nearest neighbors
song_data_knn_wkflow <- workflow() %>% 
  add_model(song_data_knn) %>% 
  add_recipe(song_data_recipe)

# Polynomial regression
song_data_poly_wkflow <- workflow() %>% 
  add_model(song_data_poly) %>% 
  add_recipe(song_data_recipe_poly)

# Elastic net
song_data_en_wkflow <- workflow() %>% 
  add_model(song_data_en) %>% 
  add_recipe(song_data_recipe)

# Random forest
song_data_rf_wkflow <- workflow() %>% 
  add_model(song_data_rf) %>% 
  add_recipe(song_data_recipe)

# Boosted trees
song_data_bt_wkflow <- workflow() %>% 
  add_model(song_data_bt) %>% 
  add_recipe(song_data_recipe)

Making Grids

Next, we create our tuning grids. These allow us to test a range of values for each parameter that we would like to tune. Because the linear regression model does not have any parameters to tune, we instead can go straight into fitting.

K-nearest neighbors:
- k - the number of neighbors each new point is compared to
Polynomial regression:
- degree - the number of degrees of polynomials we will evaluate
Elastic net:
- mixture - determines the type of regularized regression (0 for ridge, 1 for lasso, 0 < mixture < 1 for a mix of both)
- penalty - determines the amount of regularization to affect complexity
Tree based models:
- mtry - the number of parameters evaluated at each split
- trees - the number of trees evaluated per model
- min_n (random forest only) - if a data point has fewer data points in a node than the value of min_n, then it will not split
- learning_rate (boosted trees only) - a value that determines how fast the model learns

# Linear regression has no tuning parameters, so we do not need to make a grid for it
song_data_lm_fit <- fit_resamples(object = song_data_lm_wkflow, resamples = song_data_folds)

# K-nearest neighbors
song_data_knn_grid <- grid_regular(neighbors(range = c(1, 10)), 
                                   levels = 10)

# Polynomial regression
song_data_poly_grid <- grid_regular(degree(range = c(1, 10)), 
                                    levels = 10)

# Elastic net
song_data_en_grid <- grid_regular(mixture(range = c(0, 1)),
                                  penalty(range = c(0, 1),
                                          trans = identity_trans()),
                                  levels = 10)

# Random forest
song_data_rf_grid <- grid_regular(mtry(range = c(1, 13)), 
                                  trees(range = c(200, 600)), 
                                  min_n(range = c(10, 20)), 
                                  levels = 5)

# Boosted trees
song_data_bt_grid <- grid_regular(mtry(range = c(1, 13)), 
                                  trees(range = c(200, 600)), 
                                  learn_rate(range = c(-10, -1)), 
                                  levels = 5)

Tuning and Fitting Models

Now that the grids are created, we can tune the models. This by far takes the longest to run because of the sheer number of models we are running. Between 6 models, 10 folds per model, and 1-3 tuned parameters per fold, we are making several thousands of models in one sitting. This is why in the next few steps, we save and load our models. This makes it so that we do not have to wait ~30 minutes each time we want to run the code.

# Linear regression has nothing to tune

# K-nearest neighbors
song_data_knn_res <- tune_grid(object = song_data_knn_wkflow, 
                               resamples = song_data_folds, 
                               grid = song_data_knn_grid)

# Polynomial regression
song_data_poly_res <- tune_grid(object = song_data_poly_wkflow,
                                resamples = song_data_folds,
                                grid = song_data_poly_grid)

# Elastic net
song_data_en_res <- tune_grid(object = song_data_en_wkflow,
                              resamples = song_data_folds,
                              grid = song_data_en_grid)

# Random forest (125 models per fold)
song_data_rf_res <- tune_grid(object = song_data_rf_wkflow,
                              resamples = song_data_folds,
                              grid = song_data_rf_grid,
                              control = control_grid(verbose = TRUE))

# Boosted trees
song_data_bt_res <- tune_grid(object = song_data_bt_wkflow,
                              resamples = song_data_folds,
                              grid = song_data_bt_grid,
                              control = control_grid(verbose = TRUE))

Saving and Loading Results

Saving results:

# Linear regression has nothing to save

# K-nearest neighbors
write_rds(song_data_knn_res, file = "data\\tuned_models\\song_data_knn_res.rds")

# Polynomial regression
write_rds(song_data_poly_res, file = "data\\tuned_models\\song_data_poly_res.rds")

# Elastic net
write_rds(song_data_en_res, file = "data\\tuned_models\\song_data_en_res.rds")

# Random forest
write_rds(song_data_rf_res, file = "data\\tuned_models\\song_data_rf_res.rds")

# Boosted trees
write_rds(song_data_bt_res, file = "data\\tuned_models\\song_data_bt_res.rds")

Loading results:

# Linear regression has nothing to load

# K-nearest neighbors
song_data_knn_tuned <- read_rds("data\\tuned_models\\song_data_knn_res.rds")

# Polynomial regression
song_data_poly_tuned <- read_rds("data\\tuned_models\\song_data_poly_res.rds")

# Elastic net
song_data_en_tuned <- read_rds("data\\tuned_models\\song_data_en_res.rds")

# Random forest
song_data_rf_tuned <- read_rds("data\\tuned_models\\song_data_rf_res.rds")

# Boosted trees
song_data_bt_tuned <- read_rds("data\\tuned_models\\song_data_bt_res.rds")

Collecting Metrics

Now that our models are done being tuned, we can finally see how they did!

song_data_lm_best <- select_best(song_data_lm_fit, metric = "rmse")
song_data_lm_rmse <- show_best(song_data_lm_fit, metric = "rmse")$mean

song_data_knn_best <- select_best(song_data_knn_tuned, metric = "rmse")
song_data_knn_rmse <- show_best(song_data_knn_tuned, n = 1, metric = "rmse")$mean

song_data_poly_best <- select_best(song_data_poly_tuned, metric = "rmse")
song_data_poly_rmse <- show_best(song_data_poly_tuned, n = 1, metric = "rmse")$mean

song_data_en_best <- select_best(song_data_en_tuned, metric = "rmse")
song_data_en_rmse <- show_best(song_data_en_tuned, n = 1, metric = "rmse")$mean

song_data_rf_best <- select_best(song_data_rf_tuned, metric = "rmse")
song_data_rf_rmse <- show_best(song_data_rf_tuned, n = 1, metric = "rmse")$mean

song_data_bt_best <- select_best(song_data_bt_tuned, metric = "rmse")
song_data_bt_rmse <- show_best(song_data_bt_tuned, n = 1, metric = "rmse")$mean

metrics_tibble <- tibble(Model = c("Linear Regression", "K-Nearest Neighbors", "Polynomial Regression", "Elastic Net", "Random Forest", "Boosted Trees"), RMSE = c(song_data_lm_rmse, song_data_knn_rmse, song_data_poly_rmse, song_data_en_rmse, song_data_rf_rmse, song_data_bt_rmse))

metrics_tibble %>% 
  arrange(RMSE) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE) %>% 
  scroll_box(width = "100%", height = "100%")

Model	RMSE
Elastic Net	63.22820
Polynomial Regression	63.61429
Linear Regression	64.53431
Boosted Trees	71.76425
Random Forest	74.58363
K-Nearest Neighbors	99.06299

Analysis and Model Selection

Now that our models, are completed, let’s analyze them to see how each one did individually. We use the autoplot function to graph summaries of our models’ performance. We will be judging the models based on their RMSE (root mean square error). It essentially shows us the average difference between the predicted value and the actual value. The lower the RMSE, the better. Take our elastic net model as an example. An RMSE of 63.22820 means that on average, the model’s prediction was about 63 notes off from the true value. Not bad, considering the number of notes often goes into the thousands! Since the linear regression model had no tuned parameters, there is no selection that we need to make for it, as it only has one model.

K-Nearest Neighbors

show_best(song_data_knn_tuned, metric = "rmse", n = 1)

## # A tibble: 1 × 7
##   neighbors .metric .estimator  mean     n std_err .config              
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1         4 rmse    standard    99.1    10    8.03 Preprocessor1_Model04

autoplot(song_data_knn_tuned, metric = "rmse") + theme_bw()

In the k-nearest neighbors model, our tuned parameter was the number of neighbors. We can see that our best model had 4 neighbors. There was a significant decrease in RMSE between 1 and 2 neighbors, as well as 2 and 3. Past 4 neighbors, the RMSE slowly began to increase. With too few neighbors, the model is too simple. It is not given enough data to accurately assign a value because it does not have enough reference points. With too many neighbors, the opposite happens: the model becomes too complicated. It begins to overfit to the data, meaning that it specializes too much with the given training set. This makes it more difficult to accurately predict new data. Having 4-5 neighbors generally gives a complex enough model without running the risk of overfitting.

Polynomial Regression

show_best(song_data_poly_tuned, metric = "rmse", n = 1)

## # A tibble: 1 × 7
##   degree .metric .estimator  mean     n std_err .config              
##    <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1      2 rmse    standard    63.6    10    3.78 Preprocessor02_Model1

autoplot(song_data_poly_tuned, metric = "rmse") + theme_bw()

For the polynomial regression model, we tuned the degree of polynomials, from 1 to 10. Our best model had degree 2. This means that our quantitative data was mostly linear, possibly quadratic in nature. However, it does not seem that degree affected the data much until degree 7. Even when including the linear regression RMSE (which was only approximately 1 above degree 2), the graph stayed relatively flat until it made a large jump from degree 8 to degree 9. This happens because past a certain point, the model becomes too complicated for the data. The perfect degree for our dataset is lower due to its simplicity.

Elastic Net

show_best(song_data_en_tuned, metric = "rmse", n = 1)

## # A tibble: 1 × 8
##   penalty mixture .metric .estimator  mean     n std_err .config               
##     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
## 1       1       1 rmse    standard    63.2    10    3.91 Preprocessor1_Model100

autoplot(song_data_en_tuned, metric = "rmse") + theme_bw()

The elastic net model had two tuned parameters: mixture (“Proportion of Lasso Penalty”) and penalty (“Amount of Regularization”). It appears that in this case, the highest values of both performed the best. Our most complex model was the most successful, meaning that by nature, our data was fairly simple.

Random Forest

show_best(song_data_rf_tuned, metric = "rmse", n = 1)

## # A tibble: 1 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     7   500    10 rmse    standard    74.6    10    7.69 Preprocessor1_Model0…

autoplot(song_data_rf_tuned, metric = "rmse") + theme_bw()

The random forest model used three tuned parameters: mtry, trees, and min_n. Observing mtry first, we can see that before 4 parameters evaluated per split, the RMSE is significantly higher than after. From 7 parameters onward, the results flatline. The number of trees did not seem to affect the results, as all of the lines stayed very close to each other. Though the difference is small, min_n did affect the models. Having a smaller minimal node size did better than having a larger one. In other words, the model performed better when it required fewer data points in a node to split a tree.

Boosted Trees

show_best(song_data_bt_tuned, metric = "rmse", n = 1)

## # A tibble: 1 × 9
##    mtry trees learn_rate .metric .estimator  mean     n std_err .config         
##   <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1    13   500        0.1 rmse    standard    71.8    10    7.11 Preprocessor1_M…

autoplot(song_data_bt_tuned, metric = "rmse") + theme_bw()

The boosted trees model tuned mtry, trees, and learn_rate. It appears that for this model, mtry did not affect the results at all. This is interesting, because it did have an effect on the random forest model. It’s likely due to the fact that many of the predictors are similar to each other; we have five difficulty categories and four note count categories, so there is bound to be some correlation. The graphs may seem a little strange, because three of the five graphs only show one line for the number of trees. This is because the autoplot function is optimized to show at least one point in each category of learning range. We can however see that in the fourth graph, there is some separation between the number of trees. This means that creating an average with more data was successful. Lastly, learning rate seemed to be the most impactful parameter by far. The fastest learning rate being the most successful indicates that our data was highly correlated, and may also be due to the (relatively) small data set.

Choosing the Best Model

Congratulations to elastic net #100! With an RMSE of 63.22820, it barely edged out the polynomial regression and linear regression models. Let’s analyze the most important variables for this model.

song_data_en_final <- finalize_workflow(song_data_en_wkflow, song_data_en_best)
song_data_en_final <- fit(song_data_en_final, song_data_train)

song_data_en_final %>% extract_fit_parsnip() %>% 
  vip(aesthetics = c(fill = "#00cdba")) +
  theme_bw()

We can see that by far, note_count_expert was the most important variable. This makes sense, as expert is the next highest difficulty before master, meaning that they would have the strongest correlation. The next two variables are difficulty_master and length. Again, difficulty_master is the most correlated to the response variable of the difficulty variables, so naturally it is also high in importance. Of the three variables that were not difficulty or note count (length, BPM, and unit), length was the most important. This makes sense logically, as having more time in a song means more time to squeeze in notes.

The lowest importance variables were difficulty and note count on easy and normal, as well as unit. This is understandable, because there is less correlation as the gaps in difficulty increase. Having more notes on easy may correlate to having more notes on master, but when the difficulty is that far away, the assumption is less reliable. The unit variable also did not show as much importance. Although we could make out a couple patterns from the graph, ultimately the units (aside from VS and OTHER) were close enough together in range and average that it did not affect the models significantly.

Applying the Model

So we have our best model, now what? We can finally take it out for a spin on our testing data! Here we use the augment() function, which allows us to apply the model to another set of data. We’ll create two tables, for the most and least accurate predictions. We’ll also use the round() function to round off the values to whole numbers. Finally, we’ll calculate the RMSE of our best model applied to the testing set. Here goes!

song_data_en_final %>% augment(song_data_test) %>% 
  select(name, note_count_master, .pred) %>% 
  arrange(abs(note_count_master - .pred)) %>% 
  mutate(.pred = round(.pred)) %>% 
  head() %>%
  kable() %>% 
  kable_styling(full_width = TRUE) %>% 
  scroll_box(width = "100%", height = "100%")

name	note_count_master	.pred
the WALL	1019	1020
Kashika	1186	1184
Fixer	1118	1120
Unhappy Refrain	1353	1349
Cantarella	1063	1067
Stella	1190	1194

song_data_en_final %>% augment(song_data_test) %>% 
  select(name, note_count_master, .pred) %>% 
  arrange(-abs(note_count_master - .pred)) %>% 
  mutate(.pred = round(.pred)) %>% 
  head() %>% 
  kable() %>% 
  kable_styling(full_width = TRUE) %>% 
  scroll_box(width = "100%", height = "100%")

name	note_count_master	.pred
Yaminabe!!!!	2015	1779
Alter Ego	1557	1411
Endmark ni Kibo Namida wo Soete	1948	1806
Kimiiro Marine Snow	1415	1547
Hatsune Creation Myth	2021	1900
Aishite Aishite Aishite	1173	1052

song_data_en_final %>% augment(song_data_test) %>% 
  rmse(truth = note_count_master, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        64.0

All in all, our model performed very well! Our final RMSE was 64.0, almost identical to our training set. Looking at our set of best predicted songs, they were only off by a couple of notes. On the other hand, the worst predicted songs were all off by over 100 notes. In the case of “Yaminabe!!!!”, it was off by 236 notes, about a 12.44% difference. Notably, three of the top six worst songs (“Yaminabe!!!!”“,”Endmark ni Kibo Namida wo Soete”, and “Hatsune Creation Myth”) are outlier songs, intentionally charted to be difficult. It’s understandable that the model would be less accurate with these songs, as they are harder to predict.

We can also view the testing results as a scatterplot:

song_data_en_final %>% augment(song_data_test) %>% 
  ggplot(aes(x = note_count_master, y = .pred)) +
  geom_point(col = "#00cdba") +
  geom_smooth(method = "lm", se = FALSE, col = "#0077DD") +
  theme_bw() +
  labs(x = "Note Count on Master", y = "Predicted Value")

The graph looks pretty good too! It follows the linear progression very well, with a couple clear outliers. The data points above the line mean that the model over predicted, choosing a number that was higher than the actual value. A point below the line means that the model under predicted. Overall, the graph tells us that our predictive model did a great job!

Conclusion

The aim of this project was to build a model that could accurately predict the number of notes in a master difficulty chart when given parameters such as difficulty value, song length, and BPM. We tested six models: linear regression, k-nearest neighbors, polynomial regression, elastic net, random forest, and boosted trees. After running and applying the models, we were able to conclude that the elastic net model was the most successful with an RMSE of 63.22820. Our weakest model ended up being k-nearest neighbors, with an RMSE of 99.06299. Despite having an approximately 50% higher RMSE, the k-nearest neighbors model did not perform all that poorly. In fact, all of our models did very well! Considering the fact that most charts have >1000 notes on master, getting under 100 RMSE for all of our models should be considered a huge success.

New songs are added to the game each month, either by commissions from popular vocaloid producers or adaptations of older songs. There is only room to grow for our models, as one of the limiting factors of our data was the relatively small size of songs. In the future, if we were to continue building these models, they would be better equipped to predict values given a bigger dataset. Although we did not learn them in this class, other machine learning models such as Naive Bayes or Neural Network could be applied and may see better results.

All in all, I had a blast working on this project! Being able to apply machine learning skills to something I’m passionate about has been an amazing experience, and it was a lot of fun to actually see my models predict the data and perform well. I was able to learn a lot about machine learning, and I hope you were too!

Thank you for reading!

Sources

All data was scraped manually from the following websites by me:

Predicting Note Count in a Master Chart

Ethan Vo