Predicting the winner of the great Australian Bake Off using standard Data Science tools

Part TWO - getting models for episode 2-10 and comparing them

The motivation

The motivation can be found in Part 1. Give credits where credits are due, dantaki was my main inspiration for this project.

What happened until now

In part one I created a model for episode 1. Knowing that we didn’t have the full set for this episode (because of the shorter season 1) and that it doesn’t really make sense to predict the ouptcome after the first episode because everyone can have a bad (or good) day once in a while. Nevertheless the final model we got out of the episode 1 data was surprisingly accurate in predicting which baker would win and who is going to reach the finals.

What part 2 is about

We’re going to go through the same procedure as in part one, nut now for every following episode. I won’t describe every step as detailed as in part 1. For those interested, you can either read up on part 1 or find the code on my github repo. Afterwards we’re going to compare the episodes models.

The models

We’ll train all the episodes models with a random forest classifier and will use repeated cross validation to cope with the relatively small dataset. The season 3 will be set aside for validation and once our model is validated, we’ll use the whole set, including season 3, retrain the model and prepare it for the “real life” application.

Exploring the episode’s models

First lets check, how training accuracy behaved.

Although this looks quite good, our model will probably be overfitted since accuracy equal one is normally not a good sign. The downwards jump at episode 3 comes probably from the dact that there we started to inlcude season 1 into the data and what we call episode 3 in season 1 has the same datastructure of en episode 1, where we dont have a running average over the features.

Lets take a look at the accuracy, when we fit the initial model to season 3, our testdata.

Let’s be honest. That looks horrible. There are several reasons for this behaviour. First and foremost, as I feared, the model is heavily overfitted to the training data. and secondly, since the models are independent of each other there is no reason to expect, that they’re going to get better with every episode. They even tend to get worse, because the further we go in the season, the less data (bakers) we have. But, there’s a ray of hope: We don’t care about the overall accuracy. You read correctly, we don’t care about tha accuracy of our model in general. What we are interested in is to predict the most likely winner of the episode and maybe also to predict which bakers are going to make it to the finale.

So lets take a look at how the winning probabilities of some bakers of season 3 (our testdata) evolved. We’re going to look at Olivia (the winner), Antonio (runner up), James (top tier) and Diana (bottom tier).

That actually looks nice. We see, that Olivia and Antionio are practically constantly at the top. The drop in episode 2 and 3 are probably due to the same reasons as before, that we included the season one’s data only starting from episode 3 which probably made for significant changes in the model. All of the bakers got extremely low winning probabilities in this model.

Conclusion

Although our models are not that accurate on the testdata, we can prodict the winner with quite a good accuracy. To further improve the models and so hopefully get even better predictions for the upcoming season 5, I included the data from episode 3 in the validated model. Unfortunately also these models get to an accuracy of 1 after episode 4. But we’ll see how it works out.

Coming up

This is the end of part two. Starting at Octobber 3rd, the 5th season of the Great Australian Bake Off will air. I will then post a weekly update which includes the latest predictions for the winner of this years bake off.

Code

All the raw data and the code can be found in this Github Repo

Soufflé Girl goes Data Science

El Grueff

15/09/2019