Before we get started, I’ve embedded a short playlist featuring some of my favorite songs that can be listened to in the background while reading. Unfortunately the video couldn’t be embedded directly because it was too large, and couldn’t be embedded with YouTube because of copyright restrictions, so it’s up on Vimeo. Enjoy!
library(vembedr)
embed_url("https://vimeo.com/807585775") %>%
use_align("center") %>%
use_bs_responsive()Project Sekai: Colorful Stage feat. Hatsune Miku is a Japanese rhythm game that can be played on phone or tablet. The player taps notes as they descend on the screen, building up a combo count and progressing their score the higher it gets. The game features five modes of difficulty: Easy, Normal, Hard, Expert, and Master. Within each difficulty rating, there is a rating value that each song can take. For example, the song “Ice Drop” is rated 7, 12, 18, 24, and 27 on its Easy, Normal, Hard, Expert, and Master difficulties, respectively.
The game also features five distinct musical units that perform most of the songs. They include “Leo/need” (LN), MORE MORE JUMP! (MMJ), Vivid BAD SQUAD (VBS), Wonderlands x Showtime (WXS), and 25-ji, Nightcord de. (25JI). Each group features its own aesthetic and music type, allowing them to uniquely express their characters. For example, MMJ is an idol group, meaning that they perform cute and preppy music. On the other hand, VBS made up of of street performers, so their music is often composed of electronic or hip hop style.
The game features 20 original characters, and 6 vocaloid “virtual singers”. The main objective of the franchise is to combine real singers with virtual singers, creating music in ways that have never been able to reach popular culture before.
The aim of this project is to develop a machine learning model that can accurately predict the number of notes in a chart on Master difficulty. We will do so by building up multiple models, feeding them data to learn with, and ultimately choosing the best one. Let’s begin!
First, we load the necessary packages and store the song data in the
song_data variable. We also use the
clean_names() function to make the predictor names more
readable.
library(dplyr)
library(magrittr)
library(knitr)
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(patchwork) # for making plots
library(corrplot)
library(discrim)
library(kknn)
library(ISLR)
library(ISLR2)
library(glmnet)
library(modeldata)
library(naniar)
library(xgboost)
library(ranger)
library(vip)
library(rpart.plot)
library(themis)
library(janitor)
library(kableExtra)
tidymodels_prefer()
# Loading the raw data and using clean_names to format column names
song_data <- read.csv("data\\song_data.csv")
song_data <- song_data %>%
clean_names()We’ll start by familiarizing ourselves with the data.
dim(song_data)## [1] 282 16
We have 282 songs to work with, and 16 parameters. Let’s take a look at the first few songs to see what the data looks like:
song_data %>%
head() %>%
kable() %>%
kable_styling(full_width = F) %>%
scroll_box(width = "100%", height = "378px")| number | id | name | difficulty_easy | difficulty_normal | difficulty_hard | difficulty_expert | difficulty_master | length | note_count_easy | note_count_normal | note_count_hard | note_count_expert | note_count_master | bpm | unit |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Tell Your World | 5 | 10 | 16 | 22 | 26 | 123.3 | 220 | 492 | 719 | 961 | 1147 | 150 | VS |
| 2 | 2 | Roki | 7 | 11 | 17 | 24 | 28 | 107.0 | 166 | 296 | 635 | 827 | 975 | 150 | LN |
| 3 | 3 | Teo | 9 | 14 | 19 | 27 | 32 | 116.4 | 135 | 477 | 681 | 997 | 1221 | 185 | LN |
| 4 | 6 | Hibana Reloaded | 9 | 14 | 19 | 28 | 32 | 95.0 | 198 | 394 | 568 | 808 | 1060 | 200 | LN |
| 5 | 8 | Time Machine | 6 | 11 | 16 | 23 | 26 | 124.5 | 193 | 268 | 488 | 719 | 903 | 126 | LN |
| 6 | 10 | Happy Synthesizer | 6 | 11 | 18 | 24 | 28 | 133.0 | 127 | 437 | 685 | 859 | 1136 | 127 | MMJ |
We can see that each song has five difficulty levels, ranging from easy to master, a certain number of notes per difficulty, a length (in seconds), and a BPM (beats per minute). Each song is also performed by a musical unit, or marked as OTHER as necessary.
Next, let’s check if we have any missing data:
vis_miss(song_data)Fortunately for us, our data is 100% complete! This means we won’t have to impute any data later on. Thank you to whoever scraped all of this data manually, it must’ve been a lot of work (it was me).
Next, we’ll turn any non-numeric variables that we’ll be using in our models into factors. This allows for easier classification, and is necessary for when we create the recipe later.
song_data <- song_data %>%
mutate(unit = factor(unit, c("VS", "LN", "MMJ", "VBS", "WXS", "25JI", "OTHER")))Now that our data is nice and neat, we can start to graph and analyze
it using the ggplot library! First, let’s have a look at
the distributions of our song difficulties:
diff_hist_easy <- song_data %>% ggplot(aes(x = difficulty_easy)) +
geom_histogram(fill = "#00cdba") +
xlab("Easy") +
theme_bw()
diff_hist_normal <- song_data %>% ggplot(aes(x = difficulty_normal)) +
geom_histogram(fill = "#00cdba") +
xlab("Normal") +
theme_bw()
diff_hist_hard <- song_data %>% ggplot(aes(x = difficulty_hard)) +
geom_histogram(fill = "#00cdba") +
xlab("Hard") +
theme_bw()
diff_hist_expert <- song_data %>% ggplot(aes(x = difficulty_expert)) +
geom_histogram(fill = "#00cdba") +
xlab("Expert") +
scale_x_continuous(breaks = seq(21, 31, by = 2)) +
theme_bw()
diff_hist_master <- song_data %>% ggplot(aes(x = difficulty_master)) +
geom_histogram(fill = "#00cdba") +
xlab("Master") +
scale_x_continuous(breaks = seq(26, 38, by = 2)) +
theme_bw()
diff_hist_easy + diff_hist_normal + diff_hist_hard + diff_hist_expert + diff_hist_masterEven though the ranges of values is fairly small, we can see that most of the difficulties are roughly normally distributed. For Hard, Expert, and Master, it is more apparent that the distribution has a right skew, meaning that the data is more concentrated toward the lower values. This should not affect our models, but it is nice to take note of. The reasoning is likely that songs are charted to be “playable” for most players. This means that higher difficulties are less common so that the average player can complete most songs.
Next, let’s take a look at the distributions for the number of notes in each difficulty. Since the numbers aren’t as clear-cut as the difficulty values, we’ll use box plots:
note_count_box_easy <- song_data %>% ggplot(aes(x = factor(1), y = note_count_easy)) +
geom_boxplot(fill = "white") +
geom_jitter(aes(color = unit)) +
theme_bw() +
labs(x = "Easy", y = NULL) +
scale_color_manual(values =c("VS" = "#4455DD",
"LN" = "#4455DD",
"MMJ" = "#4455DD",
"VBS" = "#4455DD",
"WXS" = "#4455DD",
"25JI" = "#4455DD",
"OTHER"= "#4455DD")) +
theme(legend.position = "none")
note_count_box_normal <- song_data %>% ggplot(aes(x = factor(1), y = note_count_normal)) +
geom_boxplot(fill = "white") +
geom_jitter(aes(color = unit)) +
theme_bw() +
labs(x = "Normal", y = NULL) +
scale_color_manual(values =c("VS" = "#6CCB20",
"LN" = "#6CCB20",
"MMJ" = "#6CCB20",
"VBS" = "#6CCB20",
"WXS" = "#6CCB20",
"25JI" = "#6CCB20",
"OTHER"= "#6CCB20")) +
theme(legend.position = "none")
note_count_box_hard <- song_data %>% ggplot(aes(x = factor(1), y = note_count_hard)) +
geom_boxplot(fill = "white") +
geom_jitter(aes(color = unit)) +
theme_bw() +
labs(x = "Hard", y = NULL) +
scale_color_manual(values =c("VS" = "#EE1166",
"LN" = "#EE1166",
"MMJ" = "#EE1166",
"VBS" = "#EE1166",
"WXS" = "#EE1166",
"25JI" = "#EE1166",
"OTHER"= "#EE1166")) +
theme(legend.position = "none")
note_count_box_expert <- song_data %>% ggplot(aes(x = factor(1), y = note_count_expert)) +
geom_boxplot(fill = "white") +
geom_jitter(aes(color = unit)) +
theme_bw() +
labs(x = "Expert", y = NULL) +
scale_color_manual(values =c("VS" = "#FF9900",
"LN" = "#FF9900",
"MMJ" = "#FF9900",
"VBS" = "#FF9900",
"WXS" = "#FF9900",
"25JI" = "#FF9900",
"OTHER"= "#FF9900")) +
theme(legend.position = "none")
note_count_box_master <- song_data %>% ggplot(aes(x = factor(1), y = note_count_master)) +
geom_boxplot(fill = "white") +
geom_jitter(aes(color = unit)) +
theme_bw() +
labs(x = "Master", y = NULL) +
scale_color_manual(values =c("VS" = "#884499",
"LN" = "#884499",
"MMJ" = "#884499",
"VBS" = "#884499",
"WXS" = "#884499",
"25JI" = "#884499",
"OTHER"= "#884499")) +
theme(legend.position = "none")
note_count_box_easy + note_count_box_normal + note_count_box_hard + note_count_box_expert + note_count_box_masterNote: Colors are for aesthetic purposes only.
From these graphs, we can see that the note counts for Easy and Master are fairly condensed, with the majority of songs within a narrow range of the box plot. However for Normal, Hard, and Expert, there is more variance in note count. Again, this ties into the idea that the majority of players play within the middle three skill ranges, so they would have the most variety. Additionally, there are only so many or so few notes you can put in a song before it becomes to cluttered or too slow. As the ends of the difficulty spectrum, they are more likely to be bound to smaller ranges.
Next, since we’ll be predicting the note count on master, let’s check
the relationship between note_count_mas and
bpm:
song_data %>% ggplot(aes(x = bpm, y = note_count_master)) +
geom_point(col = "#00cdba") +
scale_color_manual() +
geom_smooth(method = "lm", se = FALSE, col = "#0077DD") +
theme_bw() +
labs(x = "BPM", y = "Note Count on Master")This graph has an apparent positive correlation, meaning that as the BPM of a song increases, that song is more likely to have a higher number of notes on Master. Notably, the note counts skew higher as BPM increases. This can be due to the fact that higher BPM songs are often also higher difficulty, meaning that they will have more notes as an intended challenge.
Now let’s visualize the connection between length and note count on master:
song_data %>% ggplot(aes(x = length, y = note_count_master)) +
geom_point(col = "#00cdba") +
scale_color_manual() +
geom_smooth(method = "lm", se = FALSE, col = "#0077DD") +
theme_bw() +
labs(x = "Length (s)", y = "Note Count on Master")Similarly to the Note Count on Master vs BPM plot, the number of notes on Master increases as the length of the song increases. This should be a fairly obvious correlation. Having more time means being able to fit more notes within a chart. To address some outliers:
Ichi Ni no San de on Master
Yaminabe!!!! on Master. It’s a lot to take in.
Finally, let’s take a look at a predictor we haven’t discussed yet: unit. How much does the unit affect the number of notes in a chart? In other words, does the general style of music affect the note count?
song_data %>% ggplot(aes(x = unit, y = note_count_master, group = unit)) +
geom_boxplot() +
geom_jitter(aes(color = unit)) +
theme_bw() +
scale_color_manual(values = c("VS" = "#00cdba",
"LN" = "#4455DD",
"MMJ" = "#6CCB20",
"VBS" = "#EE1166",
"WXS" = "#FF9900",
"25JI" = "#884499",
"OTHER" = "black")) +
theme(legend.position = "none")This graph shows a couple interesting metrics. Let’s look at the Virtual Singer and OTHER box plots first. They have the largest boxes, indicating that there is more variation between note counts on Master. This is due to the fact that they are both miscellaneous “units”. In other words, they have no set theme of music, so they encompass everything not already done by the five main units. They also have the highest number of Master charts over 1500 notes. Looking at the OTHER category, the highest three points are “Endmark ni Kibo Namida wo Soete”, “Don’t Fight the Music”, and “The EmpErroR”. These songs are actually special collaboration songs and do not feature any singers, as they are imported from another rhythm game. They were also added with the intention of being “challenge” songs. Similarly, in the VS section, there are 15 songs with 1500 or more notes. These songs have a combined average difficulty of 33, while the average overall master difficulty is 29.
Next let’s look at the five main units. 25JI is the most visibly spread, having the highest number of Master charts under 750 notes. This supplements with their melancholic and gloomy theme. They are usually the group to take on slower songs. Conversely, WXS has the highest average number of notes per Master chart, and also the highest minimum. Their theme is circus-y, so they end up with a lot of the energetic and fast paced songs.
Great! Now that we have a good understanding of the data we’re working with, it’s time to prepare for the main event: the models.
First, we’re going to split our data into two sets: training and
testing. This allows us to first train our models with some data, and
then test it on new data that our models haven’t seen before. Before we
split the data, we have to set a seed. This ensures that when we run the
code again, the data is split in the exact same way so that we can
reproduce our results. We will also stratify the data on our outcome
variable, note_count_master. This ensures that the training
and testing sets both get an even range of values. We wouldn’t want all
of the high count charts sitting in the testing set, because when it
comes to testing, the models will be less prepared to deal with
them.
set.seed(321) # 321 is the first line of the song "Float Planner"
song_data_split <- initial_split(song_data, prop = 0.75, strata = note_count_master)
song_data_train <- training(song_data_split)
song_data_test <- testing(song_data_split)Now that our data is split, we can create the recipe for our models!
The recipe is the backbone of the project, because all of the models use
the same recipe (except for polynomial regression, which has its own
recipe). As predictors, we use all of the difficulty variables, all of
the note count variables (minus note_count_master), length,
BPM, and unit. The remaining parameters have no influence on the data
and therefore are not included in the recipe.
song_data_recipe <- recipe(note_count_master ~ difficulty_easy + difficulty_normal + difficulty_hard + difficulty_expert + difficulty_master + length + note_count_easy + note_count_normal + note_count_hard + note_count_expert + note_count_master + bpm + unit, data = song_data_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_scale(all_predictors()) %>%
step_center(all_predictors()) # standardizing predictorsBefore we start building our models, it is very important that we first set up k-fold cross validation. Essentially, k-fold cross validation is the process of splitting our training data set into “folds”. Think of it like slicing a pie, where our training set is the whole pie and the slices are our folds. Each slice is independent of the others and there is no overlap, but when put together they form the whole pie again. The purpose of doing this is to further improve the way our models train. Because we are tuning multiple parameters for a couple of our models, splitting the training set into 10 folds allows for more opportunities to test those values. It also allows us to evaluate our models at different points in their parameter tuning, so that we can pick the best one. Having multiple samples to choose from is generally better than just one!
song_data_folds <- vfold_cv(song_data_train, v = 10, strata = note_count_master)Now it’s time to actually build our models! For this project, we’ll be fitting six models, each of them with their own advantages and disadvantages in prediction:
Linear regression: In the case of this project, we use multiple
linear regression, which compares multiple input variables to our
outcome variable. It assumes that each connection is linear, and trains
itself under that assumption. You may be wondering, how can a
quantitative variable like note_count_mas be compared to a
qualitative variable like unit? This is where the function
step_dummy comes in in our recipe. In order to compare a
qualitative and quantitative variables under linear regression, the
step_dummy function creates sets of binary dummy variables
(usually 0 and 1) for each possible category within that qualitative
variable. For example, in the unit variable, there are
seven possible categories that a data observation can fall under. Say
we’re looking at three of them, LN, MMJ, and
VBS. The step_dummy function would create two
additional columns of 0s and 1s, to indicate true or false for two of
the three possibilities. Take the song “Flyer”, by VBS. If the extra
columns are given to LN and MMJ, then it would
show 0 and 0, since by elimination it would be categorized as
VBS.
K-nearest neighbors: Think of this model as a set of graphs
comparing response variables with each other in a scatterplot, sort of
like linear regression. However, instead of comparing the input and
response variables linearly, we instead compare the data points with
each other. First, we set our parameter value k (or tune it
to a range of values). In each graph, when a new data point is tested,
it is compared to the k nearest neighbors and then assigned
a value based on the values of those neighbors. For example, say we had
a scatterplot of bpm vs length, and set
k = 5. When we insert a brand new data point, it will
select the five nearest other points and evaluate the values of their
note_count_mas, then being assigned the average of those
values.
Polynomial regression: This model is more unique, because it uses
its own recipe while still using the linear_reg() function.
Instead of fitting lines based on a single degree model, it can fit
based on any degree of polynomial. This value is usually tuned. This can
be a significant improvement over a strictly linear model, because if
the data does not fit a straight line, the polynomial fit can adjust to
that with curves. However, this does lead to the risk of overfitting to
the data, especially with higher degrees of polynomials.
Elastic net: The elastic net model is a combination of the lasso
model and the ridge model, which are both in turn extensions of the
linear regression model. In short, they aim to reduce the complexity of
the model by changing different parameters; both are tuned by the
penalty value.
Random forest: The random forest model uses a setup called
decision trees to predict a variable. In a decision tree, a series of
yes/no decisions are made based on criteria for each variable. For
example, say that first a tree evaluated difficulty_master.
If it is over a certain value, the tree moves down one path. If it is
under, it moves down another path. This process continues with other
variables until a decision is made. Thousands of trees are made per
model, and results are averaged to give a final value.
Boosted trees: This model is similar to the random forest model, but instead of creating new trees each time, it uses information from previous trees to make decisions in the next.
Our first step is to create the initial models. In the case of polynomial regression, we first have to create its own recipe, then fit it to a linear regression model.
# Linear regression
song_data_lm <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
# K-nearest neighbors
song_data_knn <- nearest_neighbor(neighbor = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
# Polynomial regression
song_data_recipe_poly <- song_data_recipe %>%
step_poly(length, bpm, degree = tune())
song_data_poly <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
# Elastic net
song_data_en <- linear_reg(mixture = tune(),
penalty = tune()) %>%
set_engine("glmnet") %>%
set_mode("regression")
# Random forest
song_data_rf <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# Boosted trees
song_data_bt <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")Workflows essentially put together all the necessary information required to fit and predict each model.
# Linear regression
song_data_lm_wkflow <- workflow() %>%
add_model(song_data_lm) %>%
add_recipe(song_data_recipe)
# K-nearest neighbors
song_data_knn_wkflow <- workflow() %>%
add_model(song_data_knn) %>%
add_recipe(song_data_recipe)
# Polynomial regression
song_data_poly_wkflow <- workflow() %>%
add_model(song_data_poly) %>%
add_recipe(song_data_recipe_poly)
# Elastic net
song_data_en_wkflow <- workflow() %>%
add_model(song_data_en) %>%
add_recipe(song_data_recipe)
# Random forest
song_data_rf_wkflow <- workflow() %>%
add_model(song_data_rf) %>%
add_recipe(song_data_recipe)
# Boosted trees
song_data_bt_wkflow <- workflow() %>%
add_model(song_data_bt) %>%
add_recipe(song_data_recipe)Next, we create our tuning grids. These allow us to test a range of values for each parameter that we would like to tune. Because the linear regression model does not have any parameters to tune, we instead can go straight into fitting.
k - the number of neighbors each new point is compared
todegree - the number of degrees of polynomials we will
evaluatemixture - determines the type of regularized regression
(0 for ridge, 1 for lasso, 0 < mixture < 1 for a mix of both)penalty - determines the amount of regularization to
affect complexitymtry - the number of parameters evaluated at each
splittrees - the number of trees evaluated per modelmin_n (random forest only) - if a data point has fewer
data points in a node than the value of min_n, then it will
not splitlearning_rate (boosted trees only) - a value that
determines how fast the model learns# Linear regression has no tuning parameters, so we do not need to make a grid for it
song_data_lm_fit <- fit_resamples(object = song_data_lm_wkflow, resamples = song_data_folds)
# K-nearest neighbors
song_data_knn_grid <- grid_regular(neighbors(range = c(1, 10)),
levels = 10)
# Polynomial regression
song_data_poly_grid <- grid_regular(degree(range = c(1, 10)),
levels = 10)
# Elastic net
song_data_en_grid <- grid_regular(mixture(range = c(0, 1)),
penalty(range = c(0, 1),
trans = identity_trans()),
levels = 10)
# Random forest
song_data_rf_grid <- grid_regular(mtry(range = c(1, 13)),
trees(range = c(200, 600)),
min_n(range = c(10, 20)),
levels = 5)
# Boosted trees
song_data_bt_grid <- grid_regular(mtry(range = c(1, 13)),
trees(range = c(200, 600)),
learn_rate(range = c(-10, -1)),
levels = 5)Now that the grids are created, we can tune the models. This by far takes the longest to run because of the sheer number of models we are running. Between 6 models, 10 folds per model, and 1-3 tuned parameters per fold, we are making several thousands of models in one sitting. This is why in the next few steps, we save and load our models. This makes it so that we do not have to wait ~30 minutes each time we want to run the code.
# Linear regression has nothing to tune
# K-nearest neighbors
song_data_knn_res <- tune_grid(object = song_data_knn_wkflow,
resamples = song_data_folds,
grid = song_data_knn_grid)
# Polynomial regression
song_data_poly_res <- tune_grid(object = song_data_poly_wkflow,
resamples = song_data_folds,
grid = song_data_poly_grid)
# Elastic net
song_data_en_res <- tune_grid(object = song_data_en_wkflow,
resamples = song_data_folds,
grid = song_data_en_grid)
# Random forest (125 models per fold)
song_data_rf_res <- tune_grid(object = song_data_rf_wkflow,
resamples = song_data_folds,
grid = song_data_rf_grid,
control = control_grid(verbose = TRUE))
# Boosted trees
song_data_bt_res <- tune_grid(object = song_data_bt_wkflow,
resamples = song_data_folds,
grid = song_data_bt_grid,
control = control_grid(verbose = TRUE))Saving results:
# Linear regression has nothing to save
# K-nearest neighbors
write_rds(song_data_knn_res, file = "data\\tuned_models\\song_data_knn_res.rds")
# Polynomial regression
write_rds(song_data_poly_res, file = "data\\tuned_models\\song_data_poly_res.rds")
# Elastic net
write_rds(song_data_en_res, file = "data\\tuned_models\\song_data_en_res.rds")
# Random forest
write_rds(song_data_rf_res, file = "data\\tuned_models\\song_data_rf_res.rds")
# Boosted trees
write_rds(song_data_bt_res, file = "data\\tuned_models\\song_data_bt_res.rds")Loading results:
# Linear regression has nothing to load
# K-nearest neighbors
song_data_knn_tuned <- read_rds("data\\tuned_models\\song_data_knn_res.rds")
# Polynomial regression
song_data_poly_tuned <- read_rds("data\\tuned_models\\song_data_poly_res.rds")
# Elastic net
song_data_en_tuned <- read_rds("data\\tuned_models\\song_data_en_res.rds")
# Random forest
song_data_rf_tuned <- read_rds("data\\tuned_models\\song_data_rf_res.rds")
# Boosted trees
song_data_bt_tuned <- read_rds("data\\tuned_models\\song_data_bt_res.rds")Now that our models are done being tuned, we can finally see how they did!
song_data_lm_best <- select_best(song_data_lm_fit, metric = "rmse")
song_data_lm_rmse <- show_best(song_data_lm_fit, metric = "rmse")$mean
song_data_knn_best <- select_best(song_data_knn_tuned, metric = "rmse")
song_data_knn_rmse <- show_best(song_data_knn_tuned, n = 1, metric = "rmse")$mean
song_data_poly_best <- select_best(song_data_poly_tuned, metric = "rmse")
song_data_poly_rmse <- show_best(song_data_poly_tuned, n = 1, metric = "rmse")$mean
song_data_en_best <- select_best(song_data_en_tuned, metric = "rmse")
song_data_en_rmse <- show_best(song_data_en_tuned, n = 1, metric = "rmse")$mean
song_data_rf_best <- select_best(song_data_rf_tuned, metric = "rmse")
song_data_rf_rmse <- show_best(song_data_rf_tuned, n = 1, metric = "rmse")$mean
song_data_bt_best <- select_best(song_data_bt_tuned, metric = "rmse")
song_data_bt_rmse <- show_best(song_data_bt_tuned, n = 1, metric = "rmse")$mean
metrics_tibble <- tibble(Model = c("Linear Regression", "K-Nearest Neighbors", "Polynomial Regression", "Elastic Net", "Random Forest", "Boosted Trees"), RMSE = c(song_data_lm_rmse, song_data_knn_rmse, song_data_poly_rmse, song_data_en_rmse, song_data_rf_rmse, song_data_bt_rmse))
metrics_tibble %>%
arrange(RMSE) %>%
kable() %>%
kable_styling(full_width = TRUE) %>%
scroll_box(width = "100%", height = "100%")| Model | RMSE |
|---|---|
| Elastic Net | 63.22820 |
| Polynomial Regression | 63.61429 |
| Linear Regression | 64.53431 |
| Boosted Trees | 71.76425 |
| Random Forest | 74.58363 |
| K-Nearest Neighbors | 99.06299 |
Now that our models, are completed, let’s analyze them to see how
each one did individually. We use the autoplot function to
graph summaries of our models’ performance. We will be judging the
models based on their RMSE (root mean square error). It essentially
shows us the average difference between the predicted value and the
actual value. The lower the RMSE, the better. Take our elastic net model
as an example. An RMSE of 63.22820 means that on average, the model’s
prediction was about 63 notes off from the true value. Not bad,
considering the number of notes often goes into the thousands! Since the
linear regression model had no tuned parameters, there is no selection
that we need to make for it, as it only has one model.
show_best(song_data_knn_tuned, metric = "rmse", n = 1)## # A tibble: 1 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 rmse standard 99.1 10 8.03 Preprocessor1_Model04
autoplot(song_data_knn_tuned, metric = "rmse") + theme_bw()In the k-nearest neighbors model, our tuned parameter was the number of neighbors. We can see that our best model had 4 neighbors. There was a significant decrease in RMSE between 1 and 2 neighbors, as well as 2 and 3. Past 4 neighbors, the RMSE slowly began to increase. With too few neighbors, the model is too simple. It is not given enough data to accurately assign a value because it does not have enough reference points. With too many neighbors, the opposite happens: the model becomes too complicated. It begins to overfit to the data, meaning that it specializes too much with the given training set. This makes it more difficult to accurately predict new data. Having 4-5 neighbors generally gives a complex enough model without running the risk of overfitting.
show_best(song_data_poly_tuned, metric = "rmse", n = 1)## # A tibble: 1 × 7
## degree .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 rmse standard 63.6 10 3.78 Preprocessor02_Model1
autoplot(song_data_poly_tuned, metric = "rmse") + theme_bw()For the polynomial regression model, we tuned the degree of polynomials, from 1 to 10. Our best model had degree 2. This means that our quantitative data was mostly linear, possibly quadratic in nature. However, it does not seem that degree affected the data much until degree 7. Even when including the linear regression RMSE (which was only approximately 1 above degree 2), the graph stayed relatively flat until it made a large jump from degree 8 to degree 9. This happens because past a certain point, the model becomes too complicated for the data. The perfect degree for our dataset is lower due to its simplicity.
show_best(song_data_en_tuned, metric = "rmse", n = 1)## # A tibble: 1 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 1 rmse standard 63.2 10 3.91 Preprocessor1_Model100
autoplot(song_data_en_tuned, metric = "rmse") + theme_bw()The elastic net model had two tuned parameters: mixture
(“Proportion of Lasso Penalty”) and penalty (“Amount of
Regularization”). It appears that in this case, the highest values of
both performed the best. Our most complex model was the most successful,
meaning that by nature, our data was fairly simple.
show_best(song_data_rf_tuned, metric = "rmse", n = 1)## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 7 500 10 rmse standard 74.6 10 7.69 Preprocessor1_Model0…
autoplot(song_data_rf_tuned, metric = "rmse") + theme_bw()The random forest model used three tuned parameters:
mtry, trees, and min_n. Observing
mtry first, we can see that before 4 parameters evaluated
per split, the RMSE is significantly higher than after. From 7
parameters onward, the results flatline. The number of trees did not
seem to affect the results, as all of the lines stayed very close to
each other. Though the difference is small, min_n did
affect the models. Having a smaller minimal node size did better than
having a larger one. In other words, the model performed better when it
required fewer data points in a node to split a tree.
show_best(song_data_bt_tuned, metric = "rmse", n = 1)## # A tibble: 1 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 13 500 0.1 rmse standard 71.8 10 7.11 Preprocessor1_M…
autoplot(song_data_bt_tuned, metric = "rmse") + theme_bw()The boosted trees model tuned mtry, trees,
and learn_rate. It appears that for this model,
mtry did not affect the results at all. This is
interesting, because it did have an effect on the random forest model.
It’s likely due to the fact that many of the predictors are similar to
each other; we have five difficulty categories and four note count
categories, so there is bound to be some correlation. The graphs may
seem a little strange, because three of the five graphs only show one
line for the number of trees. This is because the autoplot
function is optimized to show at least one point in each category of
learning range. We can however see that in the fourth graph, there is
some separation between the number of trees. This means that creating an
average with more data was successful. Lastly, learning rate seemed to
be the most impactful parameter by far. The fastest learning rate being
the most successful indicates that our data was highly correlated, and
may also be due to the (relatively) small data set.
Congratulations to elastic net #100! With an RMSE of 63.22820, it barely edged out the polynomial regression and linear regression models. Let’s analyze the most important variables for this model.
song_data_en_final <- finalize_workflow(song_data_en_wkflow, song_data_en_best)
song_data_en_final <- fit(song_data_en_final, song_data_train)
song_data_en_final %>% extract_fit_parsnip() %>%
vip(aesthetics = c(fill = "#00cdba")) +
theme_bw()We can see that by far, note_count_expert was the most
important variable. This makes sense, as expert is the next highest
difficulty before master, meaning that they would have the strongest
correlation. The next two variables are difficulty_master
and length. Again, difficulty_master is the
most correlated to the response variable of the difficulty variables, so
naturally it is also high in importance. Of the three variables that
were not difficulty or note count (length, BPM, and unit), length was
the most important. This makes sense logically, as having more time in a
song means more time to squeeze in notes.
The lowest importance variables were difficulty and note count on
easy and normal, as well as unit. This is understandable,
because there is less correlation as the gaps in difficulty increase.
Having more notes on easy may correlate to having more notes on master,
but when the difficulty is that far away, the assumption is less
reliable. The unit variable also did not show as much
importance. Although we could make out a couple patterns from the graph,
ultimately the units (aside from VS and OTHER) were close enough
together in range and average that it did not affect the models
significantly.
So we have our best model, now what? We can finally take it out for a
spin on our testing data! Here we use the augment()
function, which allows us to apply the model to another set of data.
We’ll create two tables, for the most and least accurate predictions.
We’ll also use the round() function to round off the values
to whole numbers. Finally, we’ll calculate the RMSE of our best model
applied to the testing set. Here goes!
song_data_en_final %>% augment(song_data_test) %>%
select(name, note_count_master, .pred) %>%
arrange(abs(note_count_master - .pred)) %>%
mutate(.pred = round(.pred)) %>%
head() %>%
kable() %>%
kable_styling(full_width = TRUE) %>%
scroll_box(width = "100%", height = "100%")| name | note_count_master | .pred |
|---|---|---|
| the WALL | 1019 | 1020 |
| Kashika | 1186 | 1184 |
| Fixer | 1118 | 1120 |
| Unhappy Refrain | 1353 | 1349 |
| Cantarella | 1063 | 1067 |
| Stella | 1190 | 1194 |
song_data_en_final %>% augment(song_data_test) %>%
select(name, note_count_master, .pred) %>%
arrange(-abs(note_count_master - .pred)) %>%
mutate(.pred = round(.pred)) %>%
head() %>%
kable() %>%
kable_styling(full_width = TRUE) %>%
scroll_box(width = "100%", height = "100%")| name | note_count_master | .pred |
|---|---|---|
| Yaminabe!!!! | 2015 | 1779 |
| Alter Ego | 1557 | 1411 |
| Endmark ni Kibo Namida wo Soete | 1948 | 1806 |
| Kimiiro Marine Snow | 1415 | 1547 |
| Hatsune Creation Myth | 2021 | 1900 |
| Aishite Aishite Aishite | 1173 | 1052 |
song_data_en_final %>% augment(song_data_test) %>%
rmse(truth = note_count_master, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 64.0
All in all, our model performed very well! Our final RMSE was 64.0, almost identical to our training set. Looking at our set of best predicted songs, they were only off by a couple of notes. On the other hand, the worst predicted songs were all off by over 100 notes. In the case of “Yaminabe!!!!”, it was off by 236 notes, about a 12.44% difference. Notably, three of the top six worst songs (“Yaminabe!!!!”“,”Endmark ni Kibo Namida wo Soete”, and “Hatsune Creation Myth”) are outlier songs, intentionally charted to be difficult. It’s understandable that the model would be less accurate with these songs, as they are harder to predict.
We can also view the testing results as a scatterplot:
song_data_en_final %>% augment(song_data_test) %>%
ggplot(aes(x = note_count_master, y = .pred)) +
geom_point(col = "#00cdba") +
geom_smooth(method = "lm", se = FALSE, col = "#0077DD") +
theme_bw() +
labs(x = "Note Count on Master", y = "Predicted Value")The graph looks pretty good too! It follows the linear progression very well, with a couple clear outliers. The data points above the line mean that the model over predicted, choosing a number that was higher than the actual value. A point below the line means that the model under predicted. Overall, the graph tells us that our predictive model did a great job!
The aim of this project was to build a model that could accurately predict the number of notes in a master difficulty chart when given parameters such as difficulty value, song length, and BPM. We tested six models: linear regression, k-nearest neighbors, polynomial regression, elastic net, random forest, and boosted trees. After running and applying the models, we were able to conclude that the elastic net model was the most successful with an RMSE of 63.22820. Our weakest model ended up being k-nearest neighbors, with an RMSE of 99.06299. Despite having an approximately 50% higher RMSE, the k-nearest neighbors model did not perform all that poorly. In fact, all of our models did very well! Considering the fact that most charts have >1000 notes on master, getting under 100 RMSE for all of our models should be considered a huge success.
New songs are added to the game each month, either by commissions from popular vocaloid producers or adaptations of older songs. There is only room to grow for our models, as one of the limiting factors of our data was the relatively small size of songs. In the future, if we were to continue building these models, they would be better equipped to predict values given a bigger dataset. Although we did not learn them in this class, other machine learning models such as Naive Bayes or Neural Network could be applied and may see better results.
All in all, I had a blast working on this project! Being able to apply machine learning skills to something I’m passionate about has been an amazing experience, and it was a lot of fun to actually see my models predict the data and perform well. I was able to learn a lot about machine learning, and I hope you were too!
Thank you for reading!
All data was scraped manually from the following websites by me: