Predicting the winner of the great Australian Bake Off using standard Data Science tools

Part ONE - getting to a model for episode 1

The motivation

Inspired by this guthub Repo by dantaki, which was sent to me by my boyfriend, who worries that my obsession with baking and everything concerning the bake off is taking overhand, I decided to redo his analysis. But since there’s not much fun in just redoing absolutely the same, I decided to go with the Great Australian Bake Off instead of the Great British Bake Off.

On the pro-side this will probably get some different results and it could be interesting to compare them afterwards and there’s also a new season starting in two weeks time, so not long to wait, until the algorithm can be tested in a “real life setting”. On the con-side, the Australian counterpart of the show has only run for 4 seasons until today. So setting one season aside for testing, we’ll have just the data from 3 seasons to find our algorithm.

Same as datanki I’ll get my data from the wikipedia page of the Great Australian Bake Off. Until now there have been 4 seasons. Season 1 consisted of 10 bakers and 8 episodes, while seasons 2, 3 and 4 featured 12 bakers and 10 episodes in total. Since the show has evovled somewhat over time (different judges and hosts for instance) we’ll take season 3 as testset. The main reason for this is that I believe season 4 should be used to tune the model, since it is the “closest” to the upcoming season 5. I don’t want to use either season 2 or season 1 for testing, since especially season 1 has been a bit different to the other ones.

The approach

As mentioned, I’ll get the data from the wikipedia page. I’ll then use the caret package to try to find a working classification algorithm that predicts the ranking of a contestant based on two things:

  • The performance in the most recent episode and
  • the averaged performance in all previous episodes

So in the end, after each show we can feed the algorithm the latest scores and hopefully get a decent prediction as to who is going to win this years bake off and additionally who will make it to the finale.

We’ll train the model on the data of seasons 1, 2 and 4 and use season 3 as validation set. After we found the right model, we’ll retrain the model using all 4 seasons in the hope of having an even better prediction model for the upcoming season 5.

Loading and preparing the data

So, after my third batch of macarons finally succeeded today, let’s start with the data stuff, maybe THAT will work out quicker. Let’s load the data and take a look at how the tables present themselves.

baker episode rank_in_technical mean_rank_in_technical star mean_star top mean_top flop mean_flop final_rank final_group X
Nancy 3 1 1.000000 0 0.0000000 0 0.0000000 0 0.0000000 1 1 winner
Nancy 4 7 4.000000 0 0.0000000 1 0.5000000 0 0.0000000 1 1 winner
Nancy 5 4 4.000000 0 0.0000000 0 0.3333333 0 0.0000000 1 1 winner
Nancy 6 3 3.750000 0 0.0000000 0 0.2500000 1 0.2500000 1 1 winner
Nancy 7 3 3.600000 1 0.2000000 0 0.2000000 0 0.2000000 1 1 winner
Nancy 8 2 3.333333 0 0.1666667 0 0.1666667 1 0.3333333 1 1 winner

Basically we have one observation from each baker for each show. Since I typed in the data myself, I wont bother too much with cleaning it. Let’s look at the variables we got here.

key value
baker name of the baker
episode Episode we’re looking at
rank_in_technical rank the baker got in the technical bake
mean_rank_in_technical average rank in technical over the last episodes (including actual one)
star indicates wether the baker was star baker in the actual episode
mean_star average over how many times the baker got star baker up to and including the actual episode
top indicates wether the baker made one of the top bakes in the actual episode
mean_top average over how many times the baker made one of the episodes top bakes
flop indicates wether the baker made one of the least favorite bakes in the actual episode
mean_flop average over how many times the baker made one of the least favorite bakes
final_rank the final rank the baker got
final_group indicates in which group the baker falls, will be our classifier
X how the groups/classifier may be named

Since we want to predict based on the most recent episode, we’ll have to create datasets of episodes rather than sets of seasons. Remember that the first season had 2 bakers less than the other ones thus also 2 episodes less? We’ll solve this by changing the episodes numbers in this season so that the first episode is now the third episode. Lets look at the head of episode 3:

baker episode rank_in_technical mean_rank_in_technical star mean_star top mean_top flop mean_flop final_rank final_group X
Nancy 3 1 1 0 0 0 0 0 0 1 1 winner
Jonathan 3 6 6 0 0 0 0 0 0 2 2 runner_up
Maria 3 2 2 1 1 0 0 0 0 2 2 runner_up
Monique 3 4 4 0 0 0 0 0 0 4 3 top_tier
Brendan 3 3 3 0 0 0 0 0 0 5 3 top_tier
Julie 3 7 7 0 0 1 1 0 0 6 4 bottom_tier

Training the model for episode 1

Remember that we didn’t include the episodes from season 3 in our datasets. We’ll use them to test our model in the end. But since we don’t have that many datapoints, we’ll have to use some kind of cross-validation to tune our model before we’ll use it on the testset. We’re going to use repeated cross validation.

We fit a random forest model to our data from episode 1 and get the following confusion matrix.

##              Reference
## Prediction    bottom_tier runner_up top_tier winner
##   bottom_tier          12         1        1      2
##   runner_up             0         2        0      0
##   top_tier              0         1        5      0
##   winner                0         0        0      0

So obviously the model is a bit conservative when it comes to predicting a winner. But in my opinion this makes sense, since we’re looking at episode 1 right now. After the first episode our data will be very homogenious, all the variables corresponding to a running mean are just the same as the actual data from this week and everyone, also the bakers in the bake off, can have a good or bad day in a while. Plus don’t forget, that we have one third less data for episodes 1 and 2 since they’re not present in our notation of season 1. Overall our model has Accuracy of 79%. It’s not perfect obviously bit it is decent enough considering the points mentioned above and it is way better than just randomly predicting classes.

Predicting the winner

But hold on, we still want to predict a winner, right? Of course we do. Even if the model doesn’t predict one right away, we can look at the class probabilities for each of the bakers and check, who has the highest probability of winning. And spoiler alert it turns out, the model is quite good, even after just one episode.

baker bottom_tier runner_up top_tier winner real_outcome
Claudia 0.472 0.102 0.104 0.322 winner
Sian 0.620 0.028 0.086 0.266 winner
Jessica 0.620 0.028 0.086 0.266 bottom_tier
Brendan 0.836 0.044 0.028 0.092 bottom_tier
Dave 0.316 0.388 0.228 0.068 runner_up
Peter 0.838 0.014 0.086 0.062 bottom_tier
Alex 0.838 0.014 0.086 0.062 bottom_tier
Chris 0.338 0.244 0.374 0.044 top_tier
Nathan 0.676 0.012 0.274 0.038 top_tier
Mariana 0.808 0.012 0.144 0.036 bottom_tier
James 0.342 0.274 0.352 0.032 top_tier
Barb 0.342 0.274 0.352 0.032 runner_up
Emma 0.584 0.148 0.248 0.020 bottom_tier
Marcus 0.796 0.152 0.034 0.018 bottom_tier
Suzy 0.376 0.564 0.044 0.016 runner_up
Ben 0.620 0.090 0.276 0.014 bottom_tier
Raeesa 0.336 0.076 0.580 0.008 top_tier
Robert 0.198 0.042 0.752 0.008 top_tier
Meg 0.762 0.098 0.136 0.004 bottom_tier
Max 0.912 0.002 0.082 0.004 bottom_tier
Jasmin 0.546 0.412 0.040 0.002 runner_up
Angela 0.094 0.018 0.886 0.002 top_tier
Janice 0.644 0.320 0.034 0.002 bottom_tier
Michelle 0.688 0.254 0.058 0.000 bottom_tier

Not Too bad, right? What about if we just want to predict, which bakers are going to get to the final. We then count together the probababilities for the class “winner” and “runner up”. Lets create a new column with that variable.

baker bottom_tier runner_up top_tier winner real_outcome gets_to_final
Suzy 0.376 0.564 0.044 0.016 runner_up 0.580
Dave 0.316 0.388 0.228 0.068 runner_up 0.456
Claudia 0.472 0.102 0.104 0.322 winner 0.424
Jasmin 0.546 0.412 0.040 0.002 runner_up 0.414
Janice 0.644 0.320 0.034 0.002 bottom_tier 0.322
James 0.342 0.274 0.352 0.032 top_tier 0.306
Barb 0.342 0.274 0.352 0.032 runner_up 0.306
Sian 0.620 0.028 0.086 0.266 winner 0.294
Jessica 0.620 0.028 0.086 0.266 bottom_tier 0.294
Chris 0.338 0.244 0.374 0.044 top_tier 0.288
Michelle 0.688 0.254 0.058 0.000 bottom_tier 0.254
Marcus 0.796 0.152 0.034 0.018 bottom_tier 0.170
Emma 0.584 0.148 0.248 0.020 bottom_tier 0.168
Brendan 0.836 0.044 0.028 0.092 bottom_tier 0.136
Ben 0.620 0.090 0.276 0.014 bottom_tier 0.104
Meg 0.762 0.098 0.136 0.004 bottom_tier 0.102
Raeesa 0.336 0.076 0.580 0.008 top_tier 0.084
Peter 0.838 0.014 0.086 0.062 bottom_tier 0.076
Alex 0.838 0.014 0.086 0.062 bottom_tier 0.076
Nathan 0.676 0.012 0.274 0.038 top_tier 0.050
Robert 0.198 0.042 0.752 0.008 top_tier 0.050
Mariana 0.808 0.012 0.144 0.036 bottom_tier 0.048
Angela 0.094 0.018 0.886 0.002 top_tier 0.020
Max 0.912 0.002 0.082 0.004 bottom_tier 0.006

Tadaa! The model predicts 4 of the 6 finalists correctly after just the first episode.

Apllying model to the testdata

So lets look at how our model works with the testdata from season 3.

##              Reference
## Prediction    bottom_tier runner_up top_tier winner
##   bottom_tier           4         1        3      1
##   runner_up             0         0        0      0
##   top_tier              2         1        0      0
##   winner                0         0        0      0

Accuracy is down to 33%. That’s unfortunate. But still, lets make the same exercise as before and predict the winner by looking at who has the highest probability of winning.

baker bottom_tier runner_up top_tier winner real_outcome
Olivia 0.472 0.102 0.104 0.322 winner
Diana 0.620 0.028 0.086 0.266 bottom_tier
Monica 0.136 0.228 0.596 0.040 runner_up
Antonio 0.676 0.012 0.274 0.038 runner_up
Bojan 0.808 0.012 0.144 0.036 bottom_tier
Janette 0.342 0.274 0.352 0.032 bottom_tier
Fiona 0.792 0.170 0.010 0.028 top_tier
Liesel 0.584 0.148 0.248 0.020 top_tier
James 0.552 0.152 0.280 0.016 top_tier
Jeremy 0.336 0.076 0.580 0.008 bottom_tier
Cheryl 0.762 0.098 0.136 0.004 bottom_tier
Noel 0.688 0.254 0.058 0.000 bottom_tier

That looks way better. Even if the model doesn’t predict the final classifications absolutely correctly it predicts that Olivia, the real winner of this season, has the highest probability of winning.

So let’s look also here to the bakers with the highest probability of making it to the finale.
baker bottom_tier runner_up top_tier winner real_outcome gets_to_final
Olivia 0.472 0.102 0.104 0.322 winner 0.424
Janette 0.342 0.274 0.352 0.032 bottom_tier 0.306
Diana 0.620 0.028 0.086 0.266 bottom_tier 0.294
Monica 0.136 0.228 0.596 0.040 runner_up 0.268
Noel 0.688 0.254 0.058 0.000 bottom_tier 0.254
Fiona 0.792 0.170 0.010 0.028 top_tier 0.198
Liesel 0.584 0.148 0.248 0.020 top_tier 0.168
James 0.552 0.152 0.280 0.016 top_tier 0.168
Cheryl 0.762 0.098 0.136 0.004 bottom_tier 0.102
Jeremy 0.336 0.076 0.580 0.008 bottom_tier 0.084
Antonio 0.676 0.012 0.274 0.038 runner_up 0.050
Bojan 0.808 0.012 0.144 0.036 bottom_tier 0.048

That could obviously be better. It only predicts 1 of the 3 finalists correctly. But we’ll take a look at how the predictions change within the season.

Retraining the model including Season 3

As mentioned before, we’ll retrain the model using also the third season. This will (hopefully) help us, have an even better model for the upcoming fifth season.

##              Reference
## Prediction    bottom_tier runner_up top_tier winner
##   bottom_tier          18         1        3      1
##   runner_up             0         5        1      0
##   top_tier              0         0        5      0
##   winner                0         0        0      2

As hoped, accuracy has gone up. We are now at 83%. Lets take a quick look at the predicted winners and finalists, before we’ll let go of episode 1 and turn to the next ones.

baker bottom_tier runner_up top_tier winner real_outcome
Claudia 0.046 0.064 0.004 0.886 winner
Olivia 0.046 0.064 0.004 0.886 winner
Sian 0.698 0.000 0.000 0.302 winner
Jessica 0.698 0.000 0.000 0.302 bottom_tier
Diana 0.698 0.000 0.000 0.302 bottom_tier
Brendan 0.762 0.016 0.000 0.222 bottom_tier
Dave 0.022 0.812 0.122 0.044 runner_up
Chris 0.094 0.184 0.702 0.020 top_tier
Monica 0.006 0.810 0.164 0.020 runner_up
Suzy 0.056 0.860 0.068 0.016 runner_up
Peter 0.940 0.016 0.036 0.008 bottom_tier
Alex 0.940 0.016 0.036 0.008 bottom_tier
Fiona 0.332 0.018 0.644 0.006 top_tier
Jasmin 0.250 0.634 0.116 0.000 runner_up
Angela 0.152 0.020 0.828 0.000 top_tier
James 0.384 0.314 0.302 0.000 top_tier
Nathan 0.154 0.434 0.412 0.000 top_tier
Ben 0.850 0.000 0.150 0.000 bottom_tier
Janice 0.730 0.220 0.050 0.000 bottom_tier
Meg 0.906 0.018 0.076 0.000 bottom_tier
Mariana 0.908 0.054 0.038 0.000 bottom_tier
Barb 0.384 0.314 0.302 0.000 runner_up
Raeesa 0.570 0.000 0.430 0.000 top_tier
Robert 0.036 0.060 0.904 0.000 top_tier
Marcus 0.794 0.028 0.178 0.000 bottom_tier
Michelle 0.938 0.056 0.006 0.000 bottom_tier
Emma 0.512 0.004 0.484 0.000 bottom_tier
Max 0.794 0.094 0.112 0.000 bottom_tier
Antonio 0.154 0.434 0.412 0.000 runner_up
Liesel 0.512 0.004 0.484 0.000 top_tier
James 0.158 0.084 0.758 0.000 top_tier
Noel 0.938 0.056 0.006 0.000 bottom_tier
Jeremy 0.570 0.000 0.430 0.000 bottom_tier
Bojan 0.908 0.054 0.038 0.000 bottom_tier
Cheryl 0.906 0.018 0.076 0.000 bottom_tier
Janette 0.384 0.314 0.302 0.000 bottom_tier
Interestingly the winners are predicted very accuratly, but there are some bakers who ended up in the bottom tier who follow them instead of the runner ups. But we also see, that those bakers who finished as runner ups seem to have a great probability of ending up there, so lets look at potential finalists:
baker bottom_tier runner_up top_tier winner real_outcome gets_to_final
Claudia 0.046 0.064 0.004 0.886 winner 0.950
Olivia 0.046 0.064 0.004 0.886 winner 0.950
Suzy 0.056 0.860 0.068 0.016 runner_up 0.876
Dave 0.022 0.812 0.122 0.044 runner_up 0.856
Monica 0.006 0.810 0.164 0.020 runner_up 0.830
Jasmin 0.250 0.634 0.116 0.000 runner_up 0.634
Nathan 0.154 0.434 0.412 0.000 top_tier 0.434
Antonio 0.154 0.434 0.412 0.000 runner_up 0.434
James 0.384 0.314 0.302 0.000 top_tier 0.314

7 out of 10 is not too bad in my opinion. We’ll check at how this number improves over the episodes, once we have models for all of them.

Coming up

This is the end of part one, since otherwise this post would be waaay too long. But stay tuned of the upcoming posts about the next episodes and of course, the validation of the models in real life, once this years season is running. Part 2 can be found here

Code

All the raw data and the code can be found in this Github Repo