MIDI loaded into a digital audio workstation. Note the piano rolls.
Videogame music is awesome background music for working. Think about it. Its created specifically for when you’re doing hand-eye coordinated work on a computer. Its entertaining enough to listen to and yet subtle enough to let you focus on something else. Its a perfect match!
But sometimes the music is so good that its stuck in your head long after you’ve listened to it. “Oh no,” you say. “Its happening again. An earworm has burrowed into my brain… blarrrrgh!”
In those cases, what helps is to grab the song’s MIDI file, load it into a Digital Audio Workstation and mess around with it to see how the song works down at the level of individual notes. You might think this is really nerdy and niche, and it is, but it works! And I’m not the only one who says so.
Finding that midi file is usually straight-forward. Most of the time the midi files for a videogame track are located in a different website from the audio files. That means you’ll have to browse at least two websites for your audio+midi needs. Wouldn’t it be cool if there was a website where the midi files and the audio files were laid out side-by-side? For thousands and thousands of songs?
The rest of this post will assume that you answered “yes” to that question. Its great that you did, because here’s an idea about how to make it happen. Its so simple. Take all the midi files from one website, and all the audio files from another website, and match up the midi files with the audio files. Bingo bango. A database of audio files and their associated midis.
I don’t know how it will solve the problem of earworms.We’ll figure that out later. Once that database exists, we could do so much other cool stuff! For example, we could automatically generate score-following youtube videos or we could train machine-learning algorithms to transcribe mp3s into midis. The possibilities are really enticing. Let’s take a stab at making it a reality.
Matching up a midi with its correpsonding audio file, (in our case, an mp3) is easy if you only need to do it once. But doing it at scale is a tough problem. Finding each match by hand would take years. Developing an algorithm to automate the whole procedure is a better use of our time. The question is, which algorithm do we use?
Fortunately, this question has been addressed by others in the field of Music Information Retrieval, a discipline sitting at the cross-roads of music and data science. There have been some recent advances in retrieving audio files with midis, notably the construction of the Lakh Midi Dataset.
Inspired by this project, I familiarized myself with the code base and reproduced the methods with my own smaller dataset to assess whether I could leverage the code base for my website project. My goal was to use the author’s pretty_midi python package to find matches in my midi-mp3 dataset using just the audio information and ignoring metadata like filenames.
And for the songs in our small dataset, we’re going with one of the all-time greatest videogame soundtracks: Final Fantasy 7.
FFVII’s Tifa reimagined by Nikolas Draper-Ivey
To assemble the small dataset, we are web-scraping 85 mp3s and 60 midis for Final Fantasy 7 from two different websites. That’s a small enough dataset that we can frequently iterate ideas. Its also small enough that we can hand-label some data to validate ideas. Yet, considering the fact that matching up every midi requires making 85*60 = 5100
comparisons, the dataset is also large enough to notice any major inefficencies in our matching algorithm.
To measure how well an mp3 matches up with a midi, we’ll use the align_midi
module from the pretty_midi
package to generate a distance score for each midi-mp3 pair. (The align_midi
module was tricky to implement with my particular laptop and python configuration. But I found some workarounds that I will share with you all in a future post. In that same future post I will explain in more detail what the align_midi
module does.)
For those of you wondering how the align_midi
script works, here’s the gist of it:
The whole process is a bit like taking the piano roll representation of a midi and an audio file and trying to stretch and squish the piano rolls until they overlap as close to perfect as you can get.
The distance score is one of two key pieces of data we’ll use for evaluating this whole procedure. Since each midi-mp3 pair will have its own distance score, we’ll have 5100 of them by the end of the script. Then, predicting a match for each midi will simply be a process of finding the mp3 with the lowest score.
To evaluate our predictions, the second key piece of data in this experiment will be a hand-labeled dataset of matches. Being able to hand label the dataset is one advantage of working with smaller datasets. This dataset was painstakingly created in excel and then saved as a csv. It serves as our “gold standard.”
Okay now for some data cleaning and munging. Load tidyverse. We will want its pipes and ggplots.
library(tidyverse)
And then load the data. First, load the scores dataset, which is the matrix of distance scores for all the pairs, and then load the matches dataset, which is a boolean matrix of which pairs are actual matches.
# we have a matrix of dtw scores, each score from comparing a midi with an mp3.
# we also have a boolean matrix of matches between the midis and mp3s
scores <- read.csv(file = 'scores.csv', header = TRUE, row.names = 1)
matches <- read.csv(file = 'validation_names.csv', header = TRUE, row.names = 1)
Each dataset has some dirty data that needs to be removed. In the matches datset, there are midis where no matching mp3 exists and so will always give a false positive. Let’s remove those midis from consideration. For the scores dataset, some of the midis could not be synthesized into audio, which gives a nonsense value for all scores in that column, so let’s remove those midis from consideration as well.
# clean the data by identifying and removing midis that didn't have
# a corresponding mp3 or midis that did not load for some reason
no_matches <- which((apply(matches, 2, sum) <= 0))
load_errors <- which((apply(scores, 2, sum) <= 0))
#remove those midis from consideration
clean_scores <- scores[,-c(no_matches, load_errors)]
clean_matches <- matches[,-c(no_matches, load_errors)]
Now that we have clean data, let’s consider whether we need to transform it in some way. Recall that the scores are the result of multiplying a chunk of a midi’s spectrogram with a corresponding chunk in an mp3’s spectrogram. Since chunks represent the same amount of time, longer songs will have more chunks. And since each multiplication of chunks produces a positive number, the scores increase monotonically with the number of chunks (ie length) of each audio file.
To visualize this relationship let’s
Because of the hypothesized duration dependence, we should see a strong, positive relationship between square file size and distance score.
midi_sizes <- list.files("~/Python/pretty-midi/FF7",
full.names = TRUE,
pattern = "mid$") %>% file.size()
mp3_sizes <- list.files("~/Python/pretty-midi/FF7_mp3s",
full.names = TRUE,
pattern = "mp3$") %>% file.size()
size_vec <- mp3_sizes %o% midi_sizes %>% #get square file sizes
.[,-c(no_matches, load_errors)] %>% #clean matrix
as.vector() #convert to 1-d vector for plotting
score_vec <- clean_scores %>% as.matrix %>% as.vector
qplot(x = size_vec,
y = score_vec,
xlab = "Pair Size (bytes^2)",
ylab = "Distance Score") +
geom_smooth(method='lm', formula = y ~ x - 1) #start line from zero
And there it is. A positive relationship between the size of the two files and their distance score.
Because the algorithm always predicts the lowest scoring midi-mp3 pair, file size will confound our ability to detect good matches between midi and mp3. Shorter audio files will always be preferred even when they are not a match. That would give us terrible results. One easy fix for this confound is to simply z-score normalize the data by row and column.
#by column and row, center and scale the cleaned up scores matrix
#so that all scores are in the same range.
#the more negative the score, the better. The average score is 0.
scaled_scores <- scale(clean_scores) #column normalize
scaled_scores <- t(scale(t(scaled_scores))) #row normalize
This normalization procedure will get rid of the file size dependence so that the score reflects the closeness between two files relative to their file size. We can verify that the file size dependence is gone with the following plot:
qplot(x = size_vec,
y = scaled_scores %>% as.vector,
xlab = "Pair Size (bytes^2)",
ylab = "Scaled Distance Score") +
geom_smooth(method='lm', formula = y ~ x - 1)
So now with that major confounder accounted for, let’s see how our matching algorithm did.
To recap, our matching algorithm
So let’s quickly impliment this algorithm and then proceed to carefully evaluate its results to see whether its going to scale well. Two lines is all we need to assign our matrix of predicted matches to a predictions
variable.
min_scores <- apply(scaled_scores, 2, which.min) #evaluate all midis
predictions <- outer(1:nrow(scaled_scores), min_scores, "==") #store predictions
To gauge the difficulty of the matching problem we are facing, the following density plot is useful. It gives the density of scores for the matches in blue and non-matches in red.
#find the scores given to the matches
#transform to long form and store in a data frame for convenient ggplotting
scores_df <- data.frame(scores = scaled_scores %>% as.vector,
match = clean_matches %>% as.matrix %>% as.vector,
prediction = predictions %>% as.vector)
ggplot(scores_df, aes(x = scores, fill = as.logical(match))) +
theme_minimal() +
geom_density(alpha = 0.4) +
labs(fill = "Is Match", x = "Scaled Distance Score")
Our task is to essentailly use the scaled distance score to distinguish the blue group from the red group; i.e. midi-mp3 pairs that are correct matches from those that are non-matches. Fortunately, it appears that matches tend to be negative whereas non-matches tend to be about zero. This trend is statistically significant when we use a t-test to check whether the mean of matched scores is less than zero at the 99% confidence level.
matched_scores <- scaled_scores[clean_matches == 1]
unmatched_scores <- scaled_scores[clean_matches == 0]
#check statistical significance
t.test(matched_scores, mu = 0, alternative = "less", conf.level = 0.99)
##
## One Sample t-test
##
## data: matched_scores
## t = -12.487, df = 55, p-value < 2.2e-16
## alternative hypothesis: true mean is less than 0
## 99 percent confidence interval:
## -Inf -1.725096
## sample estimates:
## mean of x
## -2.134712
Furthermore, if we test whether the mean of unmatched scores is different from zero at the 99% confidence level, we fail to reject. Additionally, the 99% confidence interval covers a range that is hardly different from zero.
t.test(unmatched_scores, mu = 0, alternative = "two.sided", conf.level = 0.99)
##
## One Sample t-test
##
## data: unmatched_scores
## t = 1.8172, df = 4703, p-value = 0.06925
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
## -0.01062417 0.06145064
## sample estimates:
## mean of x
## 0.02541324
Taken together, the results are consistent with the notion that the scaled distance score is an accurate measure of audio file similarity. Matches tend to be negative and non-matches tend to be close to zero.
We can already see good separation between matches (blue) and non-matches (red) along the scaled distance score axis. Its reasonable to predict that our other classifier metrics will have decent performance as well.
A confusion matrix is usually one of the first things to check when evaluating a classifier. It gives the counts of true-positives, false-positives, true-negatives, and false-negatives.
confusion_mat <- table(predictions %>% as.vector(),
clean_matches %>% as.matrix %>% as.vector(),
dnn = c("predicted", "actual"))
confusion_mat
## actual
## predicted 0 1
## FALSE 4687 17
## TRUE 17 39
According to the confusion matrix,
5100 - 4*85
)meaning that only a small fraction of pairs are true matches. We can boil these numbers down into a few common metrics for binary classifiers:
(TN + TP) / total
, is 0.9928571TP / (TP + FP)
, is 0.6964286TP / (TP + FN)
, is 0.6964286(2 * precision * recall)/(precision + recall)
, is 0.6964286In our specific application, we want to be very sure that the midi matches up with the mp3. So we would be willing incorrectly reject actual matches if it meant we could reduce the chances of accepting a non-match. Its not clear how this trade-off would happen with our particular algorithm, but future implimentations could use a cost function that punishes false-positives more than it does false-negatives.
Something reassuring about our algorithm is that it tends to predict different matches for different midis; ie it does not collapse to a single prediction. Roughly 80% of our predictions are a unique midi-mp3 pair. Had we left the distance scores unscaled, our predictions would’ve collapsed onto the shortest audio file, making the same prediction for every midi file.
One of the most common methods for evaluating a classifier is by plotting its Receiver Operator Characteristic (ROC) curve and calculating the Area Underneath the Curve (AUC). The AUC is a nice measure since it tells us the prediction performance of our classifier without forcing us to choose an arbitrary threshold.
library(pROC)
scores_roc <- roc(response = scores_df$match, predictor = scores_df$scores)
plot.roc(scores_roc,
print.auc = TRUE,
print.thres = "best")
Here we can see that the AUC for our classifier is 0.9140321 and the “best” threshold (i.e. the threshold with the highest sum sensitivity + specificity) is also plotted. According to our ROC curve, the best threshold is a scaled score of -1.214, which would give 4329 positive predictions. We can’t use this “best” threshold because it would devastate our classifier’s performance. Instead, we leverage the knowledge that there will be a single, correct match for each midi. But it is still a useful metric for assessing how well scaled score can predict matches.
Now that we’ve evaluated this matching method, we should consider how well it would perform in our target conditions. That is,
Looking around the internet, I believe the answer is yes. For the midis, we can scrape from The Videogame Music Archive and for the mp3s we can scrape from Retrotracks , assuming we have the proper permissions. There might even be enough metadata from the filenames and directory names to construct a gold standard to train our classifier. In other words, we could construct a dataset of matches based on file names to train our dtw algorithm to find matches based on audio.
Leaving aside the issue of the specific matching algorithm we use, the fact that these data exist bodes well for the general feasibility of the project.
I am confident that the dtw could produce a distance score for millions of pairs, but the rate at which it does so would be prohibitive. I estimate that finding the optimial warping for each midi-mp3 pair takes 5 seconds. For a dataset of 10,000 songs, that means evaluating 10^{8} pairs which would take about 15.854896 years. Far, far, too long.
So the distance calculations will somehow have to be optimized for speed.
As of now our classifier identifies 70% of actual matches correctly. That is a good start, but there is a hidden assumption that may have inflated the precision. When we said we would take the closest mp3 for every midi file, we assumed that every midi would have an mp3 that matches. If that weren’t true, we would increase our false positives.
In this specific case we made sure this was true – and that’s a problem. Building algorithms that have assumptions about the data informed by our knowledge of the test dataset is part of a more general problem in machine learning called data leakage.
We won’t have the luxury of hand-lableling the correct classifications for a dataset of thousands of songs. But we might be able to get close by matching song titles and then training an algorithm to use audio to get us the rest of the way.
This set of methods is a good start that could benefit from faster dtw calculations and a better classifier. Future experiments should compare classification accuracy when using metadata like song titles versus just using audio data. Thanks for reading along!