In our last lab, we created a data.frame entitled final.elos with elo ratings for every team at the end of the 2016 season. In this lab we’re going to test the quality of those ELO ratings.
We’ve talked about root mean square error (RMSE) before. We calculate it as:
\[ RMSE = \sqrt{\frac{1}{n}\Sigma (prediction - actual)^2 }\]
Or, in other words, first you calculate the errors, then you square them, then you average them and finally you take the square root. Let’s write a function for this in R:
RMSE <- function(predictions, actuals){
sqrt(mean((predictions-actuals)^2))
}
Log Loss in another metric, not that different from root mean square error, that people use to estimate the size of errors. It happens to be the metric that Kaggle uses for the tournament. We’ll talk more about the particulars of log loss (after we talk about logarithms) but for now, here’s the formula:
\[LogLoss = - \frac{1}{n} \sum\limits_{i=1}^n [y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]\]
and here’s a function to calculate it in R, notice that it takes the same two inputs as our RMSE function:
LogLoss <- function(predictions, actuals){
(-1/length(predictions)) * sum (actuals * log(predictions) + (1-actuals)*log(1-predictions))
}
First check to make sure that you have the elo ratings from last lab handy. If not, you’ll need to go back to our last lab to recreate them.
View(final.elos)
Next, we’ll need to read in the results of the 2016 tournament which we’ll use to evaluate our ratings.
tourney <- read.csv('/home/rstudioshared/shared_files/data/TourneyCompactResults.csv')
View(tourney)
## Season Daynum Wteam Wscore Lteam Lscore Wloc Numot
## 1 1985 136 1116 63 1234 54 N 0
## 2 1985 136 1120 59 1345 58 N 0
## 3 1985 136 1207 68 1250 43 N 0
## 4 1985 136 1229 58 1425 55 N 0
## 5 1985 136 1242 49 1325 38 N 0
## 6 1985 136 1246 66 1449 58 N 0
This file dates all the way back to 1985, we just need the 2016 results. While filtering the results, we can also add a column entitled “win” that is simply a column of 1’s since from the perspective of Wteam, every game was a win.
library(dplyr)
tourney2016 <- tourney %>% filter(Season==2016) %>% mutate(win=1)
If you take a look at our new data.frame, you’ll see that there are now 67 games, just enough for all but 1 of the 68 teams to get eliminated.
Now, it’s time to join these tournament games with our ELO ratings:
tourney2016 <- left_join(tourney2016, final.elos, by=c("Wteam"="team"))
tourney2016 <- left_join(tourney2016, final.elos, by=c("Lteam"="team"))
Notice that we needed to perform two joins to match our games with our ELO rating for the winning team and with our ELO rating for the losing team. It’s probably a good idea to take another look at our data.frame.
View(tourney2016)
Now, we’ll need the function we created in the last lab that predicts winning %’s using ELO ratings:
Ewins <- function(rating, opp.rating)
{ 1/(1 + 10^((opp.rating-rating)/400))}
We can use it to make predictions for each tournament game
tourney2016 <- tourney2016 %>% mutate(prediction = Ewins(elo.end.x, elo.end.y))
View(tourney2016)
According to our ELO ratings, what game was the surest bet in the tournament? Do these game probabilities seem correct? If not, what’s wrong?
Now, let’s calculate RMSE and Log Loss for our ELO tournament predictions:
tourney2016 %>% summarize(rmse = RMSE(prediction, win), logloss = LogLoss(prediction,win))
## rmse logloss
## 1 0.4768786 0.6455538
How did we do? Actually, it’s a bit hard to tell. If we predicted that every team had a 50% chance of winning every game our RMSE would be 0.500 and our Log Loss would be 0.693. So, at least we did better than that!
But how did we stack up against serious competition? Looking at the 2016 Kaggle Leaderboard, we see that we would have finished 415th out of 598 data scientists. On the upside, we’re already besting 31% of the competition but on the other hand, we still have a lot of work to do!
Here are some ideas:
Using multiple years worth of data. It takes a while for Kaggle to find out who the best teams are particularly since teams from the best and worst divisions may not play each other that often. What if instead of staring every team with an ELO rating of 1500 to start the 2016 season, we ran elo on the 2015 season and used the final 2015 ELO ratings as the starting values for 2016?
Taking into account home games and back-to-back games. We now know that teams are more likely to win at home and when they have not played the day before. Could we adjust our expected wins formula to account for these factors?
Using scoring margins. A win is a win… but not when you’re trying to make predicitons. A 20 point win is much more compelling than a nail bitter and tells us more about who the better team is. Perhaps we could take margin or victory into account.
A different K value. We used a K value of 25 to calculate our ELO ratings but what if we’d used either a larger or smaller K?
Can you think of other ideas that we could use to improve our predictions?