Testing ELO

As you may well recall…

In our last lab, we created a data.frame entitled final.elos with elo ratings for every team at the end of the 2016 season. In this lab we’re going to test the quality of those ELO ratings.

Root Mean Square Error

We’ve talked about root mean square error (RMSE) before. We calculate it as:

\[ RMSE = \sqrt{\frac{1}{n}\Sigma (prediction - actual)^2 }\]

Or, in other words, first you calculate the errors, then you square them, then you average them and finally you take the square root. Let’s write a function for this in R:

RMSE <- function(predictions, actuals){
  sqrt(mean((predictions-actuals)^2))
}

Log Loss

Log Loss in another metric, not that different from root mean square error, that people use to estimate the size of errors. It happens to be the metric that Kaggle uses for the tournament. We’ll talk more about the particulars of log loss (after we talk about logarithms) but for now, here’s the formula:

\[LogLoss = - \frac{1}{n} \sum\limits_{i=1}^n [y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]\]

and here’s a function to calculate it in R, notice that it takes the same two inputs as our RMSE function:

LogLoss <- function(predictions, actuals){
  (-1/length(predictions)) * sum (actuals * log(predictions) + (1-actuals)*log(1-predictions))
}

Evaluatiing 2016 ELO Ratings

First check to make sure that you have the elo ratings from last lab handy. If not, you’ll need to go back to our last lab to recreate them.

View(final.elos)

Next, we’ll need to read in the results of the 2016 tournament which we’ll use to evaluate our ratings.

tourney <- read.csv('/home/rstudioshared/shared_files/data/TourneyCompactResults.csv')
View(tourney)

##   Season Daynum Wteam Wscore Lteam Lscore Wloc Numot
## 1   1985    136  1116     63  1234     54    N     0
## 2   1985    136  1120     59  1345     58    N     0
## 3   1985    136  1207     68  1250     43    N     0
## 4   1985    136  1229     58  1425     55    N     0
## 5   1985    136  1242     49  1325     38    N     0
## 6   1985    136  1246     66  1449     58    N     0

This file dates all the way back to 1985, we just need the 2016 results. While filtering the results, we can also add a column entitled “win” that is simply a column of 1’s since from the perspective of Wteam, every game was a win.

library(dplyr)
tourney2016 <- tourney %>% filter(Season==2016) %>% mutate(win=1)

If you take a look at our new data.frame, you’ll see that there are now 67 games, just enough for all but 1 of the 68 teams to get eliminated.

Now, it’s time to join these tournament games with our ELO ratings:

tourney2016 <- left_join(tourney2016, final.elos, by=c("Wteam"="team"))
tourney2016 <- left_join(tourney2016, final.elos, by=c("Lteam"="team"))

Notice that we needed to perform two joins to match our games with our ELO rating for the winning team and with our ELO rating for the losing team. It’s probably a good idea to take another look at our data.frame.

View(tourney2016)

Now, we’ll need the function we created in the last lab that predicts winning %’s using ELO ratings:

Ewins <- function(rating, opp.rating)
{ 1/(1 + 10^((opp.rating-rating)/400))}

We can use it to make predictions for each tournament game

tourney2016 <- tourney2016 %>% mutate(prediction = Ewins(elo.end.x, elo.end.y))

View(tourney2016)

According to our ELO ratings, what game was the surest bet in the tournament? Do these game probabilities seem correct? If not, what’s wrong?

Now, let’s calculate RMSE and Log Loss for our ELO tournament predictions:

tourney2016 %>% summarize(rmse = RMSE(prediction, win), logloss = LogLoss(prediction,win))

##        rmse   logloss
## 1 0.4768786 0.6455538

How did we do? Actually, it’s a bit hard to tell. If we predicted that every team had a 50% chance of winning every game our RMSE would be 0.500 and our Log Loss would be 0.693. So, at least we did better than that!

But how did we stack up against serious competition? Looking at the 2016 Kaggle Leaderboard, we see that we would have finished 415th out of 598 data scientists. On the upside, we’re already besting 31% of the competition but on the other hand, we still have a lot of work to do!

How can we do better?

Here are some ideas:

Using multiple years worth of data. It takes a while for Kaggle to find out who the best teams are particularly since teams from the best and worst divisions may not play each other that often. What if instead of staring every team with an ELO rating of 1500 to start the 2016 season, we ran elo on the 2015 season and used the final 2015 ELO ratings as the starting values for 2016?
Taking into account home games and back-to-back games. We now know that teams are more likely to win at home and when they have not played the day before. Could we adjust our expected wins formula to account for these factors?
Using scoring margins. A win is a win… but not when you’re trying to make predicitons. A 20 point win is much more compelling than a nail bitter and tells us more about who the better team is. Perhaps we could take margin or victory into account.
A different K value. We used a K value of 25 to calculate our ELO ratings but what if we’d used either a larger or smaller K?

Can you think of other ideas that we could use to improve our predictions?