Soufflé Girl goes Data Science

Part ONE - getting to a model for episode 1

The motivation

Inspired by this guthub Repo by dantaki, which was sent to me by my boyfriend, who worries that my obsession with baking and everything concerning the bake off is taking overhand, I decided to redo his analysis. But since there’s not much fun in just redoing absolutely the same, I decided to go with the Great Australian Bake Off instead of the Great British Bake Off.

On the pro-side this will probably get some different results and it could be interesting to compare them afterwards and there’s also a new season starting in two weeks time, so not long to wait, until the algorithm can be tested in a “real life setting”. On the con-side, the Australian counterpart of the show has only run for 4 seasons until today. So setting one season aside for testing, we’ll have just the data from 3 seasons to find our algorithm.

Same as datanki I’ll get my data from the wikipedia page of the Great Australian Bake Off. Until now there have been 4 seasons. Season 1 consisted of 10 bakers and 8 episodes, while seasons 2, 3 and 4 featured 12 bakers and 10 episodes in total. Since the show has evovled somewhat over time (different judges and hosts for instance) we’ll take season 3 as testset. The main reason for this is that I believe season 4 should be used to tune the model, since it is the “closest” to the upcoming season 5. I don’t want to use either season 2 or season 1 for testing, since especially season 1 has been a bit different to the other ones.

The approach

As mentioned, I’ll get the data from the wikipedia page. I’ll then use the caret package to try to find a working classification algorithm that predicts the ranking of a contestant based on two things:

The performance in the most recent episode and
the averaged performance in all previous episodes

So in the end, after each show we can feed the algorithm the latest scores and hopefully get a decent prediction as to who is going to win this years bake off and additionally who will make it to the finale.

We’ll train the model on the data of seasons 1, 2 and 4 and use season 3 as validation set. After we found the right model, we’ll retrain the model using all 4 seasons in the hope of having an even better prediction model for the upcoming season 5.

Loading and preparing the data

So, after my third batch of macarons finally succeeded today, let’s start with the data stuff, maybe THAT will work out quicker. Let’s load the data and take a look at how the tables present themselves.

baker	episode	rank_in_technical	mean_rank_in_technical	star	mean_star	top	mean_top	flop	mean_flop	final_rank	final_group	X
Nancy	3	1	1.000000	0	0.0000000	0	0.0000000	0	0.0000000	1	1	winner
Nancy	4	7	4.000000	0	0.0000000	1	0.5000000	0	0.0000000	1	1	winner
Nancy	5	4	4.000000	0	0.0000000	0	0.3333333	0	0.0000000	1	1	winner
Nancy	6	3	3.750000	0	0.0000000	0	0.2500000	1	0.2500000	1	1	winner
Nancy	7	3	3.600000	1	0.2000000	0	0.2000000	0	0.2000000	1	1	winner
Nancy	8	2	3.333333	0	0.1666667	0	0.1666667	1	0.3333333	1	1	winner

Basically we have one observation from each baker for each show. Since I typed in the data myself, I wont bother too much with cleaning it. Let’s look at the variables we got here.

key	value
baker	name of the baker
episode	Episode we’re looking at
rank_in_technical	rank the baker got in the technical bake
mean_rank_in_technical	average rank in technical over the last episodes (including actual one)
star	indicates wether the baker was star baker in the actual episode
mean_star	average over how many times the baker got star baker up to and including the actual episode
top	indicates wether the baker made one of the top bakes in the actual episode
mean_top	average over how many times the baker made one of the episodes top bakes
flop	indicates wether the baker made one of the least favorite bakes in the actual episode
mean_flop	average over how many times the baker made one of the least favorite bakes
final_rank	the final rank the baker got
final_group	indicates in which group the baker falls, will be our classifier
X	how the groups/classifier may be named

Since we want to predict based on the most recent episode, we’ll have to create datasets of episodes rather than sets of seasons. Remember that the first season had 2 bakers less than the other ones thus also 2 episodes less? We’ll solve this by changing the episodes numbers in this season so that the first episode is now the third episode. Lets look at the head of episode 3:

baker	episode	rank_in_technical	mean_rank_in_technical	star	mean_star	top	mean_top	final_rank	final_group	X
Nancy	3	1	1	0	0	0	0	1	1	winner
Jonathan	3	6	6	0	0	0	0	2	2	runner_up
Maria	3	2	2	1	1	0	0	2	2	runner_up
Monique	3	4	4	0	0	0	0	4	3	top_tier
Brendan	3	3	3	0	0	0	0	5	3	top_tier
Julie	3	7	7	0	0	1	1	6	4	bottom_tier

Training the model for episode 1

Remember that we didn’t include the episodes from season 3 in our datasets. We’ll use them to test our model in the end. But since we don’t have that many datapoints, we’ll have to use some kind of cross-validation to tune our model before we’ll use it on the testset. We’re going to use repeated cross validation.

We fit a random forest model to our data from episode 1 and get the following confusion matrix.

##              Reference
## Prediction    bottom_tier runner_up top_tier winner
##   bottom_tier          12         1        1      2
##   runner_up             0         2        0      0
##   top_tier              0         1        5      0
##   winner                0         0        0      0

So obviously the model is a bit conservative when it comes to predicting a winner. But in my opinion this makes sense, since we’re looking at episode 1 right now. After the first episode our data will be very homogenious, all the variables corresponding to a running mean are just the same as the actual data from this week and everyone, also the bakers in the bake off, can have a good or bad day in a while. Plus don’t forget, that we have one third less data for episodes 1 and 2 since they’re not present in our notation of season 1. Overall our model has Accuracy of 79%. It’s not perfect obviously bit it is decent enough considering the points mentioned above and it is way better than just randomly predicting classes.

Predicting the winner

But hold on, we still want to predict a winner, right? Of course we do. Even if the model doesn’t predict one right away, we can look at the class probabilities for each of the bakers and check, who has the highest probability of winning. And spoiler alert it turns out, the model is quite good, even after just one episode.

baker	bottom_tier	runner_up	top_tier	winner	real_outcome
Claudia	0.472	0.102	0.104	0.322	winner
Sian	0.620	0.028	0.086	0.266	winner
Jessica	0.620	0.028	0.086	0.266	bottom_tier
Brendan	0.836	0.044	0.028	0.092	bottom_tier
Dave	0.316	0.388	0.228	0.068	runner_up
Peter	0.838	0.014	0.086	0.062	bottom_tier
Alex	0.838	0.014	0.086	0.062	bottom_tier
Chris	0.338	0.244	0.374	0.044	top_tier
Nathan	0.676	0.012	0.274	0.038	top_tier
Mariana	0.808	0.012	0.144	0.036	bottom_tier
James	0.342	0.274	0.352	0.032	top_tier
Barb	0.342	0.274	0.352	0.032	runner_up
Emma	0.584	0.148	0.248	0.020	bottom_tier
Marcus	0.796	0.152	0.034	0.018	bottom_tier
Suzy	0.376	0.564	0.044	0.016	runner_up
Ben	0.620	0.090	0.276	0.014	bottom_tier
Raeesa	0.336	0.076	0.580	0.008	top_tier
Robert	0.198	0.042	0.752	0.008	top_tier
Meg	0.762	0.098	0.136	0.004	bottom_tier
Max	0.912	0.002	0.082	0.004	bottom_tier
Jasmin	0.546	0.412	0.040	0.002	runner_up
Angela	0.094	0.018	0.886	0.002	top_tier
Janice	0.644	0.320	0.034	0.002	bottom_tier
Michelle	0.688	0.254	0.058	0.000	bottom_tier

Not Too bad, right? What about if we just want to predict, which bakers are going to get to the final. We then count together the probababilities for the class “winner” and “runner up”. Lets create a new column with that variable.

baker	bottom_tier	runner_up	top_tier	winner	real_outcome	gets_to_final
Suzy	0.376	0.564	0.044	0.016	runner_up	0.580
Dave	0.316	0.388	0.228	0.068	runner_up	0.456
Claudia	0.472	0.102	0.104	0.322	winner	0.424
Jasmin	0.546	0.412	0.040	0.002	runner_up	0.414
Janice	0.644	0.320	0.034	0.002	bottom_tier	0.322
James	0.342	0.274	0.352	0.032	top_tier	0.306
Barb	0.342	0.274	0.352	0.032	runner_up	0.306
Sian	0.620	0.028	0.086	0.266	winner	0.294
Jessica	0.620	0.028	0.086	0.266	bottom_tier	0.294
Chris	0.338	0.244	0.374	0.044	top_tier	0.288
Michelle	0.688	0.254	0.058	0.000	bottom_tier	0.254
Marcus	0.796	0.152	0.034	0.018	bottom_tier	0.170
Emma	0.584	0.148	0.248	0.020	bottom_tier	0.168
Brendan	0.836	0.044	0.028	0.092	bottom_tier	0.136
Ben	0.620	0.090	0.276	0.014	bottom_tier	0.104
Meg	0.762	0.098	0.136	0.004	bottom_tier	0.102
Raeesa	0.336	0.076	0.580	0.008	top_tier	0.084
Peter	0.838	0.014	0.086	0.062	bottom_tier	0.076
Alex	0.838	0.014	0.086	0.062	bottom_tier	0.076
Nathan	0.676	0.012	0.274	0.038	top_tier	0.050
Robert	0.198	0.042	0.752	0.008	top_tier	0.050
Mariana	0.808	0.012	0.144	0.036	bottom_tier	0.048
Angela	0.094	0.018	0.886	0.002	top_tier	0.020
Max	0.912	0.002	0.082	0.004	bottom_tier	0.006

Tadaa! The model predicts 4 of the 6 finalists correctly after just the first episode.

Apllying model to the testdata

So lets look at how our model works with the testdata from season 3.

##              Reference
## Prediction    bottom_tier runner_up top_tier winner
##   bottom_tier           4         1        3      1
##   runner_up             0         0        0      0
##   top_tier              2         1        0      0
##   winner                0         0        0      0

Accuracy is down to 33%. That’s unfortunate. But still, lets make the same exercise as before and predict the winner by looking at who has the highest probability of winning.

baker	bottom_tier	runner_up	top_tier	winner	real_outcome
Olivia	0.472	0.102	0.104	0.322	winner
Diana	0.620	0.028	0.086	0.266	bottom_tier
Monica	0.136	0.228	0.596	0.040	runner_up
Antonio	0.676	0.012	0.274	0.038	runner_up
Bojan	0.808	0.012	0.144	0.036	bottom_tier
Janette	0.342	0.274	0.352	0.032	bottom_tier
Fiona	0.792	0.170	0.010	0.028	top_tier
Liesel	0.584	0.148	0.248	0.020	top_tier
James	0.552	0.152	0.280	0.016	top_tier
Jeremy	0.336	0.076	0.580	0.008	bottom_tier
Cheryl	0.762	0.098	0.136	0.004	bottom_tier
Noel	0.688	0.254	0.058	0.000	bottom_tier

That looks way better. Even if the model doesn’t predict the final classifications absolutely correctly it predicts that Olivia, the real winner of this season, has the highest probability of winning.

So let’s look also here to the bakers with the highest probability of making it to the finale.

baker	bottom_tier	runner_up	top_tier	winner	real_outcome	gets_to_final
Olivia	0.472	0.102	0.104	0.322	winner	0.424
Janette	0.342	0.274	0.352	0.032	bottom_tier	0.306
Diana	0.620	0.028	0.086	0.266	bottom_tier	0.294
Monica	0.136	0.228	0.596	0.040	runner_up	0.268
Noel	0.688	0.254	0.058	0.000	bottom_tier	0.254
Fiona	0.792	0.170	0.010	0.028	top_tier	0.198
Liesel	0.584	0.148	0.248	0.020	top_tier	0.168
James	0.552	0.152	0.280	0.016	top_tier	0.168
Cheryl	0.762	0.098	0.136	0.004	bottom_tier	0.102
Jeremy	0.336	0.076	0.580	0.008	bottom_tier	0.084
Antonio	0.676	0.012	0.274	0.038	runner_up	0.050
Bojan	0.808	0.012	0.144	0.036	bottom_tier	0.048

That could obviously be better. It only predicts 1 of the 3 finalists correctly. But we’ll take a look at how the predictions change within the season.

Retraining the model including Season 3

As mentioned before, we’ll retrain the model using also the third season. This will (hopefully) help us, have an even better model for the upcoming fifth season.

##              Reference
## Prediction    bottom_tier runner_up top_tier winner
##   bottom_tier          18         1        3      1
##   runner_up             0         5        1      0
##   top_tier              0         0        5      0
##   winner                0         0        0      2

As hoped, accuracy has gone up. We are now at 83%. Lets take a quick look at the predicted winners and finalists, before we’ll let go of episode 1 and turn to the next ones.

baker	bottom_tier	runner_up	top_tier	winner	real_outcome
Claudia	0.046	0.064	0.004	0.886	winner
Olivia	0.046	0.064	0.004	0.886	winner
Sian	0.698	0.000	0.000	0.302	winner
Jessica	0.698	0.000	0.000	0.302	bottom_tier
Diana	0.698	0.000	0.000	0.302	bottom_tier
Brendan	0.762	0.016	0.000	0.222	bottom_tier
Dave	0.022	0.812	0.122	0.044	runner_up
Chris	0.094	0.184	0.702	0.020	top_tier
Monica	0.006	0.810	0.164	0.020	runner_up
Suzy	0.056	0.860	0.068	0.016	runner_up
Peter	0.940	0.016	0.036	0.008	bottom_tier
Alex	0.940	0.016	0.036	0.008	bottom_tier
Fiona	0.332	0.018	0.644	0.006	top_tier
Jasmin	0.250	0.634	0.116	0.000	runner_up
Angela	0.152	0.020	0.828	0.000	top_tier
James	0.384	0.314	0.302	0.000	top_tier
Nathan	0.154	0.434	0.412	0.000	top_tier
Ben	0.850	0.000	0.150	0.000	bottom_tier
Janice	0.730	0.220	0.050	0.000	bottom_tier
Meg	0.906	0.018	0.076	0.000	bottom_tier
Mariana	0.908	0.054	0.038	0.000	bottom_tier
Barb	0.384	0.314	0.302	0.000	runner_up
Raeesa	0.570	0.000	0.430	0.000	top_tier
Robert	0.036	0.060	0.904	0.000	top_tier
Marcus	0.794	0.028	0.178	0.000	bottom_tier
Michelle	0.938	0.056	0.006	0.000	bottom_tier
Emma	0.512	0.004	0.484	0.000	bottom_tier
Max	0.794	0.094	0.112	0.000	bottom_tier
Antonio	0.154	0.434	0.412	0.000	runner_up
Liesel	0.512	0.004	0.484	0.000	top_tier
James	0.158	0.084	0.758	0.000	top_tier
Noel	0.938	0.056	0.006	0.000	bottom_tier
Jeremy	0.570	0.000	0.430	0.000	bottom_tier
Bojan	0.908	0.054	0.038	0.000	bottom_tier
Cheryl	0.906	0.018	0.076	0.000	bottom_tier
Janette	0.384	0.314	0.302	0.000	bottom_tier

Interestingly the winners are predicted very accuratly, but there are some bakers who ended up in the bottom tier who follow them instead of the runner ups. But we also see, that those bakers who finished as runner ups seem to have a great probability of ending up there, so lets look at potential finalists:

baker	bottom_tier	runner_up	top_tier	winner	real_outcome	gets_to_final
Claudia	0.046	0.064	0.004	0.886	winner	0.950
Olivia	0.046	0.064	0.004	0.886	winner	0.950
Suzy	0.056	0.860	0.068	0.016	runner_up	0.876
Dave	0.022	0.812	0.122	0.044	runner_up	0.856
Monica	0.006	0.810	0.164	0.020	runner_up	0.830
Jasmin	0.250	0.634	0.116	0.000	runner_up	0.634
Nathan	0.154	0.434	0.412	0.000	top_tier	0.434
Antonio	0.154	0.434	0.412	0.000	runner_up	0.434
James	0.384	0.314	0.302	0.000	top_tier	0.314

7 out of 10 is not too bad in my opinion. We’ll check at how this number improves over the episodes, once we have models for all of them.

Coming up

This is the end of part one, since otherwise this post would be waaay too long. But stay tuned of the upcoming posts about the next episodes and of course, the validation of the models in real life, once this years season is running. Part 2 can be found here

Code

All the raw data and the code can be found in this Github Repo