As you may well recall…

In our last lab, we created a data.frame entitled final.elos with elo ratings for every team at the end of the 2016 season. In this lab we’re going to test the quality of those ELO ratings.

Root Mean Square Error

We’ve talked about root mean square error (RMSE) before. We calculate it as:

\[ RMSE = \sqrt{\frac{1}{n}\Sigma (prediction - actual)^2 }\]

Or, in other words, first you calculate the errors, then you square them, then you average them and finally you take the square root. Let’s write a function for this in R:

RMSE <- function(predictions, actuals){
  sqrt(mean((predictions-actuals)^2))
}

Log Loss

Log Loss in another metric, not that different from root mean square error, that people use to estimate the size of errors. It happens to be the metric that Kaggle uses for the tournament. We’ll talk more about the particulars of log loss (after we talk about logarithms) but for now, here’s the formula:

\[LogLoss = - \frac{1}{n} \sum\limits_{i=1}^n [y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]\]

and here’s a function to calculate it in R, notice that it takes the same two inputs as our RMSE function:

LogLoss <- function(predictions, actuals){
  (-1/length(predictions)) * sum (actuals * log(predictions) + (1-actuals)*log(1-predictions))
}

Evaluatiing 2016 ELO Ratings

First check to make sure that you have the elo ratings from last lab handy. If not, you’ll need to go back to our last lab to recreate them.

View(final.elos)

Next, we’ll need to read in the results of the 2016 tournament which we’ll use to evaluate our ratings.

tourney <- read.csv('/home/rstudioshared/shared_files/data/TourneyCompactResults.csv')
View(tourney)
##   Season Daynum Wteam Wscore Lteam Lscore Wloc Numot
## 1   1985    136  1116     63  1234     54    N     0
## 2   1985    136  1120     59  1345     58    N     0
## 3   1985    136  1207     68  1250     43    N     0
## 4   1985    136  1229     58  1425     55    N     0
## 5   1985    136  1242     49  1325     38    N     0
## 6   1985    136  1246     66  1449     58    N     0

This file dates all the way back to 1985, we just need the 2016 results. While filtering the results, we can also add a column entitled “win” that is simply a column of 1’s since from the perspective of Wteam, every game was a win.

library(dplyr)
tourney2016 <- tourney %>% filter(Season==2016) %>% mutate(win=1)

If you take a look at our new data.frame, you’ll see that there are now 67 games, just enough for all but 1 of the 68 teams to get eliminated.

Now, it’s time to join these tournament games with our ELO ratings:

tourney2016 <- left_join(tourney2016, final.elos, by=c("Wteam"="team"))
tourney2016 <- left_join(tourney2016, final.elos, by=c("Lteam"="team"))

Notice that we needed to perform two joins to match our games with our ELO rating for the winning team and with our ELO rating for the losing team. It’s probably a good idea to take another look at our data.frame.

View(tourney2016)

Now, we’ll need the function we created in the last lab that predicts winning %’s using ELO ratings:

Ewins <- function(rating, opp.rating)
{ 1/(1 + 10^((opp.rating-rating)/400))}

We can use it to make predictions for each tournament game

tourney2016 <- tourney2016 %>% mutate(prediction = Ewins(elo.end.x, elo.end.y))
View(tourney2016)

According to our ELO ratings, what game was the surest bet in the tournament? Do these game probabilities seem correct? If not, what’s wrong?

Now, let’s calculate RMSE and Log Loss for our ELO tournament predictions:

tourney2016 %>% summarize(rmse = RMSE(prediction, win), logloss = LogLoss(prediction,win))
##        rmse   logloss
## 1 0.4768786 0.6455538

How did we do? Actually, it’s a bit hard to tell. If we predicted that every team had a 50% chance of winning every game our RMSE would be 0.500 and our Log Loss would be 0.693. So, at least we did better than that!

But how did we stack up against serious competition? Looking at the 2016 Kaggle Leaderboard, we see that we would have finished 415th out of 598 data scientists. On the upside, we’re already besting 31% of the competition but on the other hand, we still have a lot of work to do!

How can we do better?

Here are some ideas:

Can you think of other ideas that we could use to improve our predictions?