The Performance of AFL Prediction Models

Results Dashboard

Overview

This quick analysis is a simple look at how well the AFL prediction models sourced by Squiggle perform relative to the aggregated betting odds provided by Odds Portal.

I’ve always been rather skeptical about the utility of using Elo models to predict the head-to-head outcome of AFL matches, simply because it’s extremely difficult (if not impossible) to consistenly beat an efficient market. The reason for this is based on the Wisdom of the Crowds principle where the aggregate prediction of a large, independent and intellectually diverse population will usually be more accurate than that of any single individual, due to the canceling out of both noise and certain biases associated with individual predictions. It is therefore unlikely that a ratings system, using the same information available to the public, could consistently outperform the consensus opinion.

Elo Models

We can use Squiggles publicly available API to load the historical tipping data from all the models sourced in the site.

library(jsonlite)
library(dplyr)
library(data.table)

# Squiggle API
data = fromJSON("https://api.squiggle.com.au/?q=tips")

# Using na.omit from the data.table package
# Removing predictions with no outcome (2020 games that haven't been played)
predictions = na.omit(data$tips, cols = "correct") %>% 
  data.frame()

Since we are going to compare the predictive performance of these models against the betting data, the sample size needs to be same. Hence, we should only analyse models that have predicted every game within a specific time span. The number of predictions made by each model accross each season can be seen using a pivot table.

table(predictions$source,predictions$year)

##                        
##                         2017 2018 2019 2020
##   AFL Gains                0    0  207    0
##   AFL Lab                  0    0  207    9
##   AFL_GO                   0    0    0    9
##   AFLalytics               0    0  207    9
##   Aggregate              207  207  207    9
##   Fat Stats                0    0  207    9
##   Footy Maths Institute  207  207  207    9
##   Graft                  207  207  207    9
##   HPN                      0  207  207    0
##   Live Ladders             0  207  207    9
##   Massey Ratings           0  207  207    9
##   Matter of Stats        207  207  207    9
##   PlusSixOne             207  207  198    9
##   Punters                  0    0  207    9
##   Squiggle               207  207  207    9
##   Stattraction             0  207  207    9
##   Swinburne                0  207  207    9
##   The Arc                207  207  207    9
##   The Flag                 0    0    0    9

From the above table it is clear that most of the models were making predictions by 2018, so we’ll filter the data to only include models that have made continous predictions from 2018-2020. However, any year combination can be analysed through the shiny app that I made to summarise the results in a more in-depth way.

# Specify start and end dates
start = "2018"  
end = "2020"

# Create an index from 1 (earliest year) to N (latest year)
# corresponding to the column number for each year 
index = data.frame(t(data.frame(colnum = seq(1:length(unique(predictions$year))))))

names(index) = unique(predictions$year)

# Keep sources that have provided predictions for all the years within the start and end range
keep = names(which(
        colSums(t(table(predictions$sourceid,predictions$year)[,index[,start]:index[,end]] > 0)) ==
         length(index[,start]:index[,end]) )) %>% 
          as.integer()

predictions = predictions[predictions$sourceid %in% keep & predictions$year>=as.numeric(start), ]

In order to determine the accuracy of a prediction, Squiggle uses a measure called bits which rewards those who were more confident with a tip that was correct, and punishes those who were more confident with a tip that ended up being incorrect. I also decided to calculate the Brier scores for each prediciton, since this is a far more common way of measuring predictive accuracy (although these two measures are very highly correlated with an R-squared of ~0.98). Since Brier scores are a measure of error (specifically mean squared error), a lower score is better.

# Convert varibles to numeric form
predictions[,c("err","confidence","margin","bits")] = lapply(predictions[,c("err","confidence","margin","bits")], 
                                                             as.numeric)

# Create Brier scores
predictions$brier = (predictions$confidence/100 - predictions$correct)^2

The same process is then applied to the Odds Portal betting data, sourced from aussportsbetting.com

library(readxl)
library(lubridate)

## betting.odds data: http://www.aussportsbetting.com/data/historical-afl-results-and-odds-data/
betting.odds = read_excel("~/Documents/Docs/afl.xlsx", skip = 1) # Check header row first
betting.odds = betting.odds[,1:14] # Keep relevant columns 

# Use the same date range as the Elo models
betting.odds = betting.odds %>% filter(year(Date)>=2018)

# Convert head-to-head price to percentage odds that sum to 1 
betting.odds$Home.confidence = 
  (1/betting.odds$`Home Odds`)/((1/betting.odds$`Home Odds`)+(1/betting.odds$`Away Odds`))

betting.odds$Away.confidence = 1 - betting.odds$Home.confidence

# Odds of the favourite
betting.odds$tip.odds = apply(betting.odds[, c("Home.confidence","Away.confidence")], 1, max)

# Result of the tip
betting.odds$tip.result = 
  ifelse(betting.odds$`Home Score` - betting.odds$`Away Score` > 0 & betting.odds$Home.confidence>0.5 |
         betting.odds$`Home Score` - betting.odds$`Away Score` < 0 & betting.odds$Home.confidence<=0.5 , 1, 
         ifelse(betting.odds$`Home Score` - betting.odds$`Away Score` ==  0,0.5, 0))

betting.odds$brier = (betting.odds$tip.odds - betting.odds$tip.result)^2

betting.odds$bits = ifelse(betting.odds$tip.result == 1, 1 + log(betting.odds$tip.odds,base = 2),
                           ifelse(betting.odds$tip.result == 0,1 + log(1 - betting.odds$tip.odds,base = 2),
                              1 + 0.5*log(betting.odds$tip.odds*(1-betting.odds$tip.odds),base=2)))

Model Performance

The brier and bits scores, as well as overall accuracy can provide a basic summary of the models performance.

# Create summary table to display brier scores, bits and overall accuracy

accuracy.models = as.data.frame(group_by(predictions,source)%>%
                        summarise(Brier = round(mean(brier),3), Bits = round(mean(bits),3), 
                                  Accuracy = round(100*mean(correct),1)))

# Draws counted as correct tips
betting.odds$tip.result[betting.odds$tip.result==0.5] = 1

accuracy.betting =  data.frame(betting.odds)%>%
                        summarise(source = "Betting Odds", Brier = round(mean(brier),3), 
                                  Bits = round(mean(bits),3), Accuracy = round(100*mean(tip.result),1))

total.accuracy = rbind(accuracy.betting,accuracy.models)

# Order by brier score
total.accuracy[order(total.accuracy$Brier,decreasing = FALSE),]

##                   source Brier  Bits Accuracy
## 1           Betting Odds 0.201 0.153     69.0
## 2              Aggregate 0.202 0.149     68.1
## 5           Live Ladders 0.202 0.153     68.6
## 9               Squiggle 0.202 0.151     68.3
## 11             Swinburne 0.203 0.148     67.4
## 7        Matter of Stats 0.204 0.146     66.4
## 12               The Arc 0.205 0.139     67.8
## 3  Footy Maths Institute 0.206 0.136     68.6
## 4                  Graft 0.206 0.136     67.6
## 6         Massey Ratings 0.209 0.125     69.3
## 8             PlusSixOne 0.212 0.116     68.4
## 10          Stattraction 0.212 0.117     66.9

A deeper analysis can also be performed by assessing model performance across three componenents that make up the Brier score. These components are;

Reliability/Calibration: Measures model performance at discrete predicton intervals (ie. do teams with 60% odds win 60% of the time? etc..), lower score the better.
Resolution: A measure of variability in the accuracy of the model across the different prediction intervals. A higher score is better as it generally indicates that the model is making bolder predictions.
Uncertainty: The uncertainty of a model in predicting the correct result. A lower value indicates a more accurate model.

# Create bins for the forecasted probabilities
predictions$conf.bin = cut(predictions$confidence,breaks = seq(50,100,1),right=FALSE)
betting.odds$conf.bin = cut(100*betting.odds$tip.odds,breaks = seq(50,100,1),right=FALSE)
        
# Calculate the average correct outcomes, forecasted odds & total number of observations
# across each bin
data =  as.data.frame(group_by(predictions,conf.bin,source)%>%
          summarise(Result = mean(correct),Forecast = mean(confidence)/100,Count = n()))
         
data.betting =  as.data.frame(group_by(betting.odds,conf.bin)%>%
          summarise(Result = mean(tip.result),Forecast = mean(tip.odds),Count = n()))
        
# Elo model base rate
Base.Rate = group_by(predictions,source)%>%
                summarise(Base.Rate = mean(correct))
         
# Add the overall base rate value to binned data 
data = left_join(data,Base.Rate,by="source")
         
# Betting Odds base rate 
data.betting$Base.Rate =  mean(betting.odds$tip.result)
         
## Reliability, Resolution & Uncertainty Calculations
# Elo models
data$reliability = data$Count*(data$Forecast - data$Result)^2
data$resolution = data$Count*(data$Result - data$Base.Rate)^2
data$uncertainty = data$Base.Rate*(1-data$Base.Rate)
        
table = group_by(data,source)%>% summarise(Reliability = sum(reliability)/sum(Count), 
                                           Resolution = sum(resolution)/sum(Count),
                                           Uncertainty = mean(uncertainty),
                                           Brier = Reliability-Resolution+Uncertainty)
    
# Betting Odds 
data.betting$reliability = data.betting$Count*(data.betting$Forecast - data.betting$Result)^2
data.betting$resolution = data.betting$Count*(data.betting$Result - data.betting$Base.Rate)^2
data.betting$uncertainty = data.betting$Base.Rate*(1-data.betting$Base.Rate)

table.betting = data.betting %>% summarise(source = "Betting Odds", 
                                           Reliability = sum(reliability)/sum(Count), 
                                           Resolution = sum(resolution)/sum(Count),
                                           Uncertainty = mean(uncertainty),
                                           Brier = Reliability-Resolution+Uncertainty)
      
 table.Brier = rbind(table,table.betting)
 table.Brier[order(table.Brier$Brier),]

## # A tibble: 12 x 5
##    source                Reliability Resolution Uncertainty Brier
##    <chr>                       <dbl>      <dbl>       <dbl> <dbl>
##  1 Betting Odds               0.0157     0.0288       0.214 0.201
##  2 Squiggle                   0.0193     0.0339       0.216 0.202
##  3 Live Ladders               0.0159     0.0296       0.216 0.202
##  4 Aggregate                  0.0135     0.0285       0.217 0.202
##  5 Swinburne                  0.0179     0.0346       0.220 0.203
##  6 Matter of Stats            0.0143     0.0326       0.223 0.205
##  7 The Arc                    0.0192     0.0325       0.218 0.205
##  8 Footy Maths Institute      0.0122     0.0219       0.216 0.206
##  9 Graft                      0.0144     0.0270       0.219 0.206
## 10 Massey Ratings             0.0243     0.0280       0.213 0.209
## 11 PlusSixOne                 0.0209     0.0254       0.216 0.212
## 12 Stattraction               0.0216     0.0311       0.221 0.212

From the above table, we can see that the Betting Odds model derives its low Brier score by exhibiting moderate to good performances across each of the three components (results are easier to see in the web app), whereas the other models only perform well in one or two of these components.

# Betting odds reliability and resolution values
Betting.Odds.Reliability = as.numeric( table.Brier[table.Brier$source=="Betting Odds","Reliability"])
Betting.Odds.Resolution = as.numeric(table.Brier[table.Brier$source=="Betting Odds","Resolution"])

# Models with higher reliability and resolution than betting odds
table.Brier[table.Brier$Reliability < Betting.Odds.Reliability & 
            table.Brier$Resolution > Betting.Odds.Resolution,1]

## # A tibble: 1 x 1
##   source         
##   <chr>          
## 1 Matter of Stats

The Matter of Stats model was the only one that outperformed the betting odds in regards to reliability and resolution in the 2018-2020 period, indicating that its relatively poor overall accuracy may be due to performing slightly worse than expected when predicting the marginal 50-50 games.

Summary

It is not uncommon for various models to beat the market in any one or two year period (as can be seen in the aforementioned web app), but it is clear that no-one has been able to significantly beat the betting market over the last three seasons (Massey Ratings were only more accurate by 1 tip out of 423 games). However, this doesn’t necessarily detract from the quality of the models listed on Max Barry’s Squiggle site, but is more of a demonstration of the relative efficiency of the head to head betting market. The NFL Elo model produced by FiveThirtyEight performs similarly when compared to betting data.

Elo ratings are useful when its necessary to have an objective ranking system, like in games like tennis and chess, but if all you want to do is determine the chances of your team winning on the weekend, you might as well look at the bookmaker odds. This is why footy tipping is largely a game of luck, because the most likely results don’t always come to fruition so there will always be people who perform better (and worse) than the expected outcome. Hence, the only winning strategy is to have a fair amount of fortune.