SPE5AMS Portfolio 2: Team standing probability using sport rating system

Rating systems are used in wide range of areas. Use in sport is one of them. It has a big market space in the betting industry to predict the winner of a sport

The final aim of this is to find the standing position of the top 8 teams as if the in the AFL 2021 season final has not happened yet.

Sport Rating System

First, we want to obtain a rating ranking of AFL team up when the regular season of 2021 have finished.

Load necessary packages

library(dplyr)
library(PlayerRatings) #for rating system models
library(ggplot2)
library(caret)

1.0 Download data set

We are going to use the fitzRoy package

#Load the fitzRoy package 
library(fitzRoy)

#Use "fetch_player_stats_afltables" function to obtain data from 2021
afl21 <- fetch_player_stats_afltables(season = 2021)

#Check the rounds in each dataset 
unique(afl21$Round)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "QF" "EF" "SF" "PF" "GF"

This dataset includes the final tournament.

Because we only want the data to include the regular season, lets delete all the final data of 2021 season

#drop final data 
afl21 <- afl21[!(afl21$Round == 'QF'|afl21$Round == 'EF'|afl21$Round == 'SF'| afl21$Round == 'PF'| afl21$Round == 'GF'),]

#Change round to week 
afl21 <- afl21 %>%
  mutate(Week = ifelse(Round == 1, 1, NA), 
         Week = ifelse(Round == 2, 2, Week), 
         Week = ifelse(Round == 3, 3, Week),
         Week = ifelse(Round == 4, 4, Week),
         Week = ifelse(Round == 5, 5, Week),
         Week = ifelse(Round == 6, 6, Week),
         Week = ifelse(Round == 7, 7, Week),
         Week = ifelse(Round == 8, 8, Week),
         Week = ifelse(Round == 9, 9, Week),
         Week = ifelse(Round == 10, 10, Week),
         Week = ifelse(Round == 11, 11, Week),
         Week = ifelse(Round == 12, 12, Week),
         Week = ifelse(Round == 13, 13, Week),
         Week = ifelse(Round == 14, 14, Week),
         Week = ifelse(Round == 15, 15, Week),
         Week = ifelse(Round == 16, 16, Week),
         Week = ifelse(Round == 17, 17, Week),
         Week = ifelse(Round == 18, 18, Week),
         Week = ifelse(Round == 19, 19, Week),
         Week = ifelse(Round == 20, 20, Week),
         Week = ifelse(Round == 21, 21, Week),
         Week = ifelse(Round == 22, 22, Week),
         Week = ifelse(Round == 23, 23, Week),
  )

Now there are some columns in this dataset that we wont be using. Lets delete the unneccesary columns for this study.

We also need to create a new column called “score” that specifies the winner of the match in binary variable.

Home.team score > Away.team score : 1 (win)
Home.team score < Away.team score : 0 (loss)
Home.team score = Away.team score : 0.5 (draw)

#select needed columns
afl21 <- select(afl21, "Date", "Week", "Venue", "Home.team", "Away.team", 
                  "Home.score", "Away.score")

# Create new column of score(s)
afl21 <- afl21 %>%
  mutate(score = ifelse(Home.score > Away.score, 1, 0),
         score = ifelse(Home.score == Away.score, 0.5, score))

head(afl21)

## # A tibble: 6 x 8
##   Date        Week Venue  Home.team Away.team Home.score Away.score score
##   <date>     <dbl> <chr>  <chr>     <chr>          <int>      <int> <dbl>
## 1 2021-03-18     1 M.C.G. Richmond  Carlton          105         80     1
## 2 2021-03-18     1 M.C.G. Richmond  Carlton          105         80     1
## 3 2021-03-18     1 M.C.G. Richmond  Carlton          105         80     1
## 4 2021-03-18     1 M.C.G. Richmond  Carlton          105         80     1
## 5 2021-03-18     1 M.C.G. Richmond  Carlton          105         80     1
## 6 2021-03-18     1 M.C.G. Richmond  Carlton          105         80     1

As you can see above, we have multiple same rows of data set in each column. This is because this data also consisted of each player statistics. Lets change the table so that a match only appears once.

#Make table game unique by "date", "venue" and "week"
afl21 <- unique(afl21, by = "Date"&"Venue"&"Week")

The following code splits the dataset into train and tests

#training /testing----
#set up the sample configuration 
smp_size <- floor(0.70 * nrow(afl21))
set.seed(2)

#sample dataset
predRandom <- afl21[sample(nrow(afl21)),]

#alocate 70% as training and 30% as testing dataset
afl_train <- predRandom[1:smp_size,]
afl_test <- predRandom[(smp_size + 1): nrow(predRandom),]

Now we are ready to construct a ranking rating system! But before that, lets calculate the home game advantage score.

In AFL, teams who play as there home game has an advantage of winning. Lets make sure we take that into consideration when constructing a ranking rating system

# Calculate the home game advantage ratio
sum(afl21$Home.score) / sum(afl21$Away.score)

## [1] 1.044907

2.0 Construct an ELO model

For obtaining a ranking rating system we are going to use a Elo model. This model implements the elo rating system for estimating the relative skill level of players of two players or teams.

In Elo model, it requires three variables

init: initial value
kfac: the k factor parameter
gamma: will be the home game advantage for this study

To construct an accurate rating system we would need to find the optimal values for above variables (excluding the gamma). Lets do some hyper parameter tuning to find this:

# Build a function for Elo model that takes init and kfac as arguments 
# Should return a confusion matrix for predicted tips and final Elo rating model
elo_score <- function(init, kfac, data, status = NULL) {
  predictweeks <- unique(data$Week)
  model_vars <- c('Week', 'Home.team', 'Away.team', 'score')
  
  if(is.null(status)){
    
    predictweeks <- predictweeks[-1]
    
    elomodel <- elo(data[data$Week == 1, model_vars], 
                    init = init, 
                    kfac = kfac, 
                    gamma = 1.23,
                    history = T)
  } else {
    
    elomodel <- status
    
  }
  
  data[c('p_elo')] <- NA
  
  for(w in predictweeks){
    pred_week <- subset(data, Week == w)
    
    pred_elo <- predict(elomodel, 
                        newdata = pred_week[model_vars],
                        gamma = 1.23,
                        tng = 1, trat = 1500)
    
    data$p_elo[data$Week == w] <- pred_elo
    
    elomodel <- elo(pred_week[model_vars],
                    status = elomodel$ratings,
                    init = init,
                    gamma = 1.23, 
                    kfac = kfac, 
                    history = T)
  } 
  
  data$tip_elo <- ifelse(data$p_elo > 0.5, 1, 0)
  
  cm_elo <- confusionMatrix(data = factor(data$tip_elo, levels = c(0,0.5,1)), 
                            reference = factor(data$score, levels = c(0,0.5,1)))
  return(
    list(elomodel,
         cm_elo)
  )
}

Create a grid parameter values to test

params <- expand.grid(init = seq(1400, 3000, by = 100), 
                      kfac = seq(20,80, by = 5))

For each parameter in “params”, calculate the accuracy on the afl_train data

params$accuracy <- apply(params, 1, function(x)
  elo_score(x[1], x[2], afl_train)[[2]]$overall['Accuracy'])

# What was the best combination of variables?
subset(params, accuracy == max(params$accuracy))

##   init kfac  accuracy
## 1 1400   20 0.6153846
## 2 1500   20 0.6153846

Using the afl_test data, use the best combination parameter variables to get an estimate of expected accuracy

test_score_elo <- elo_score(init = 1400, 
                            kfac = 20, 
                            data = afl_test,
                            status = elo_score(1400,20, afl_train)[[1]])

This is the output

test_score_elo

## [[1]]
## 
## Elo Ratings For 18 Players Playing 196 Games
## 
##                    Player Rating Games Win Draw Loss Lag
## 1               Melbourne   1507    22  18    1    3   4
## 2           Port Adelaide   1482    21  16    0    5   1
## 3                 Geelong   1473    21  15    0    6   6
## 4                  Sydney   1468    22  16    0    6   3
## 5        Western Bulldogs   1460    21  14    0    7   5
## 6          Brisbane Lions   1451    23  15    0    8   4
## 7  Greater Western Sydney   1400    22  10    1   11   8
## 8                St Kilda   1400    22  11    0   11   3
## 9              West Coast   1398    22  11    0   11   0
## 10               Essendon   1394    22  10    0   12   0
## 11               Richmond   1380    23  10    1   12   0
## 12               Hawthorn   1374    22   8    2   12   1
## 13              Fremantle   1368    22   9    0   13   5
## 14                Carlton   1347    22   7    0   15   4
## 15            Collingwood   1338    21   6    0   15   2
## 16               Adelaide   1332    21   6    0   15   0
## 17             Gold Coast   1326    22   7    0   15   8
## 18        North Melbourne   1303    21   4    1   16   2
## 
## 
## [[2]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0 0.5  1
##        0   17   0  8
##        0.5  0   0  0
##        1   12   0 23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.5331, 0.7831)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.01343         
##                                           
##                   Kappa : 0.3296          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 0.5 Class: 1
## Sensitivity            0.5862         NA   0.7419
## Specificity            0.7419          1   0.5862
## Pos Pred Value         0.6800         NA   0.6571
## Neg Pred Value         0.6571         NA   0.6800
## Prevalence             0.4833          0   0.5167
## Detection Rate         0.2833          0   0.3833
## Detection Prevalence   0.4167          0   0.5833
## Balanced Accuracy      0.6641         NA   0.6641

Elo rating system is done! Now lets find the accuracy of the model

#Rating system test

#list all weeks in data set
testpredictweek <- unique(afl_test$Week)
#remove first week (use it for initial model)
testpredictweek <- testpredictweek[-1]

#Create vector containing columns passed the rating model function 
model_vars <- c('Week', 'Home.team', 'Away.team', 'score')

# BUILD 2 models(elo,steph) using first week ---- 
#Elo ----
testelomodel <- elo(afl_test[afl_test$Week == 1, model_vars], 
                     init = 1400, 
                     kfac = 20, 
                     gamma = 1.04, 
                     history = T)
print(testelomodel)

## 
## Elo Ratings For 6 Players Playing 3 Games
## 
##             Player Rating Games Win Draw Loss Lag
## 1    Port Adelaide   1410     1   1    0    0   0
## 2 Western Bulldogs   1410     1   1    0    0   0
## 3         Adelaide   1410     1   1    0    0   0
## 4          Geelong   1390     1   0    0    1   0
## 5      Collingwood   1390     1   0    0    1   0
## 6  North Melbourne   1390     1   0    0    1   0

#initialise new column in afl_tbl for prob estimates
afl_test[c('p_elo')] <- NA

#Loop through weeks where we want to generate predictions ----
for (w in testpredictweek) {
  
  #identify the round of games we are trying to predict
  testpred_week <- subset(afl_test, week = w)
  
  #Make predictions (set tng = 1 and trat = 1600, deviation 300)
  testpred_elo <- predict(testelomodel, 
                           newdata = testpred_week[model_vars], 
                           gamma = 1.07, 
                           tng = 1, trat = 1500)
  
  #Add the predicted probabilities to the test_data dataframe
  afl_test$p_elo[afl_test$Week == w] <- testpred_elo

  #Update the model using status option 
  testelomodel <- elo(testpred_week[model_vars],
                       status = testelomodel$ratings,
                       init = 1400, 
                       gamma = 1.04,
                       kfac = 20,
                       history = T)
}

#tipping the winner of matches
afl_test$tip_elo <- ifelse(afl_test$p_elo > 0.5, 1, 0)

#Build the confusion matrix ----
testcm_elo <- confusionMatrix(data = factor(afl_test$tip_elo, levels = c(0,0.5,1)),
                               reference = factor(afl_test$score, levels = c(0,0.5,1)))
testcm_elo

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0 0.5  1
##        0   17   0 19
##        0.5  0   0  0
##        1   11   0  8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4545          
##                  95% CI : (0.3197, 0.5945)
##     No Information Rate : 0.5091          
##     P-Value [Acc > NIR] : 0.8274          
##                                           
##                   Kappa : -0.0971         
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 0.5 Class: 1
## Sensitivity            0.6071         NA   0.2963
## Specificity            0.2963          1   0.6071
## Pos Pred Value         0.4722         NA   0.4211
## Neg Pred Value         0.4211         NA   0.4722
## Prevalence             0.5091          0   0.4909
## Detection Rate         0.3091          0   0.1455
## Detection Prevalence   0.6545          0   0.3455
## Balanced Accuracy      0.4517         NA   0.4517

##compare model accuracies 
table1 <- data.frame(testcm_elo$overall) %>%
  round(3) %>%
  print()

##                testcm_elo.overall
## Accuracy                    0.455
## Kappa                      -0.097
## AccuracyLower               0.320
## AccuracyUpper               0.594
## AccuracyNull                0.509
## AccuracyPValue              0.827
## McnemarPValue                 NaN

The accuracy of this model was 0.45.

Now moving on to simulations…..

Simulation

Lets predict the standing position of finals using the actual top 8 teams

1.0 Data preparation

We first have to prepare the data set from the p_elo column of afl_test data set

#ELO MODEL 
afl_test$p_elo_away <- (1 - afl_test$p_elo)

#assign unique letter to each top 8 teams 
A <- "Melbourne"
B <- "Brisbane Lions"
C <- "Western Bulldogs"
D <- "Essendon"
E <- "Sydney"
G <- "Greater Western Sydney"
H <- "Port Adelaide"
I <- "Geelong"

#Team A - Melbourne ----
A.df <- afl_test %>%
  filter(Home.team == A | Away.team == A) %>%
  arrange(Date)

A.df$p_elo <- ifelse(A.df$Home.team == A, A.df$p_elo, NA)
A.df$p_elo_away <- ifelse(A.df$Away.team == A, A.df$p_elo_away, NA)
A.df$p_elo <- ifelse(A.df$p_elo %in% NA, A.df$p_elo_away, A.df$p_elo)
A.df$Team <- A

A.df <- A.df %>%
  select(Date, Week, Team, p_elo) %>%
  arrange(Date) %>%
  filter(Week != "1")

#SD mean
A_team_sd_elo <- sd(A.df$p_elo)
A_team_mean_elo <- mean(A.df$p_elo)

# Team B- Brisbane Lions ----
B.df <- afl_test %>%
  filter(Home.team == B | Away.team == B) %>%
  arrange(Date)

B.df$p_elo <- ifelse(B.df$Home.team == B, B.df$p_elo, NA)
B.df$p_elo_away <- ifelse(B.df$Away.team == B, B.df$p_elo_away, NA)
B.df$p_elo <- ifelse(B.df$p_elo %in% NA, B.df$p_elo_away, B.df$p_elo)
B.df$Team <- B

B.df <- B.df %>%
  select(Date, Week, Team, p_elo) %>%
  arrange(Date) %>%
  filter(Week != "1")

#SD mean
B_team_sd_elo <- sd(B.df$p_elo)
B_team_mean_elo <- mean(B.df$p_elo)

# Team C- Western Bulldogs ----
C.df <- afl_test %>%
  filter(Home.team == C | Away.team == C) %>%
  arrange(Date)

C.df$p_elo <- ifelse(C.df$Home.team == C, C.df$p_elo, NA)
C.df$p_elo_away <- ifelse(C.df$Away.team == C, C.df$p_elo_away, NA)
C.df$p_elo <- ifelse(C.df$p_elo %in% NA, C.df$p_elo_away, C.df$p_elo)
C.df$Team <- C

C.df <- C.df %>%
  select(Date, Week, Team, p_elo) %>%
  arrange(Date) %>%
  filter(Week != "1")

#SD mean
C_team_sd_elo <- sd(C.df$p_elo)
C_team_mean_elo <- mean(C.df$p_elo)

# Team D- Essendon ----
D.df <- afl_test %>%
  filter(Home.team == D | Away.team == D) %>%
  arrange(Date)

D.df$p_elo <- ifelse(D.df$Home.team == D, D.df$p_elo, NA)
D.df$p_elo_away <- ifelse(D.df$Away.team == D, D.df$p_elo_away, NA)
D.df$p_elo <- ifelse(D.df$p_elo %in% NA, D.df$p_elo_away, D.df$p_elo)
D.df$Team <- D

D.df <- D.df %>%
  select(Date, Week, Team, p_elo) %>%
  arrange(Date) %>%
  filter(Week != "1")

#SD mean
D_team_sd_elo <- sd(D.df$p_elo)
D_team_mean_elo <- mean(D.df$p_elo)

# Team E- Sydney ----
E.df <- afl_test %>%
  filter(Home.team == E | Away.team == E) %>%
  arrange(Date)

E.df$p_elo <- ifelse(E.df$Home.team == E, E.df$p_elo, NA)
E.df$p_elo_away <- ifelse(E.df$Away.team == E, E.df$p_elo_away, NA)
E.df$p_elo <- ifelse(E.df$p_elo %in% NA, E.df$p_elo_away, E.df$p_elo)
E.df$Team <- E

E.df <- E.df %>%
  select(Date, Week, Team, p_elo) %>%
  arrange(Date) %>%
  filter(Week != "1")

#sd mean
E_team_sd_elo <- sd(E.df$p_elo)
E_team_mean_elo <- mean(E.df$p_elo)

# Team G- GWS ----
G.df <- afl_test %>%
  filter(Home.team == G | Away.team == G) %>%
  arrange(Date)

G.df$p_elo <- ifelse(G.df$Home.team == G, G.df$p_elo, NA)
G.df$p_elo_away <- ifelse(G.df$Away.team == G, G.df$p_elo_away, NA)
G.df$p_elo <- ifelse(G.df$p_elo %in% NA, G.df$p_elo_away, G.df$p_elo)
G.df$Team <- G

G.df <- G.df %>%
  select(Date, Week, Team, p_elo) %>%
  arrange(Date) %>%
  filter(Week != "1")

#SD
G_team_sd_elo <- sd(G.df$p_elo)
G_team_mean_elo <- mean(G.df$p_elo)

# Team H- Port Adelaide ----
H.df <- afl_test %>%
  filter(Home.team == H | Away.team == H) %>%
  arrange(Date)

H.df$p_elo <- ifelse(H.df$Home.team == H, H.df$p_elo, NA)
H.df$p_elo_away <- ifelse(H.df$Away.team == H, H.df$p_elo_away, NA)
H.df$p_elo <- ifelse(H.df$p_elo %in% NA, H.df$p_elo_away, H.df$p_elo)
H.df$Team <- H

H.df <- H.df %>%
  select(Date, Week, Team, p_elo) %>%
  arrange(Date) %>%
  filter(Week != "1")

#SD
H_team_sd_elo <- sd(H.df$p_elo)
H_team_mean_elo <- mean(H.df$p_elo)

# Team I- Geelong ----
I.df <- afl_test %>%
  filter(Home.team == I | Away.team == I) %>%
  arrange(Date)

I.df$p_elo <- ifelse(I.df$Home.team == I, I.df$p_elo, NA)
I.df$p_elo_away <- ifelse(I.df$Away.team == I, I.df$p_elo_away, NA)
I.df$p_elo <- ifelse(I.df$p_elo %in% NA, I.df$p_elo_away, I.df$p_elo)
I.df$Team <- I

I.df <- I.df %>%
  select(Date, Week, Team, p_elo) %>%
  arrange(Date) %>%
  filter(Week != "1")

#SD
I_team_sd_elo <- sd(I.df$p_elo)
I_team_mean_elo <- mean(I.df$p_elo)

2.0 Team simulation

The following code calculates the simulation.

This time, the simulation will be run 1000 times. Feel free to change the simulation number!

#Simulation of all individual team 
A_team_sim_elo <- rnorm(1000, A_team_mean_elo, A_team_sd_elo)
B_team_sim_elo <- rnorm(1000, B_team_mean_elo, B_team_sd_elo)
C_team_sim_elo <- rnorm(1000, C_team_mean_elo, C_team_sd_elo)
D_team_sim_elo <- rnorm(1000, D_team_mean_elo, D_team_sd_elo)
E_team_sim_elo <- rnorm(1000, E_team_mean_elo, E_team_sd_elo)
G_team_sim_elo <- rnorm(1000, G_team_mean_elo, G_team_sd_elo)
H_team_sim_elo <- rnorm(1000, H_team_mean_elo, H_team_sd_elo)
I_team_sim_elo <- rnorm(1000, I_team_mean_elo, I_team_sd_elo)

# New data frame with all simulation 
simulation_df_elo <- data.frame(A_team_sim_elo, B_team_sim_elo, C_team_sim_elo, 
                                  D_team_sim_elo, E_team_sim_elo, G_team_sim_elo, 
                                  H_team_sim_elo, I_team_sim_elo)

#Change column names back to team names 
simulation_df_elo <- setNames(simulation_df_elo, 
                                c('Melbourne', "Brisbane Lions", "Western Bulldogs",
                                  "Essendon", "Sydney", "Greater Western Sydney",
                                  "Port Adelaide", "Geelong"))

Using the created dataset, our aim is to create a confusion matrix plot which shows the percentage of each standings

But first the data frame needs to be fixed and ranking column has to be added. Run the following code:

#remember names 
simulation_df_elo$sim_round <- c(1:1000)
n <- simulation_df_elo$sim_round
  
#transpose all but the last column 
simulation_df_elo <- as.data.frame(t(simulation_df_elo[,-9]))
colnames(simulation_df_elo) <- n
  
#unlist the column 
ranking_df_elo <- data.frame(prob = unlist(simulation_df_elo))
ranking_df_elo$team <- c('Melbourne', "Brisbane Lions", "Western Bulldogs",
                         "Essendon", "Sydney", "Greater Western Sydney", "Port
                         Adelaide", "Geelong")
  
#relocate
ranking_df_elo <- ranking_df_elo %>%
  relocate(team, .before = prob)
  
#round no. 
x = 1:1000
ranking_df_elo$round <- rep(x, each = 8)
  
#relocate
ranking_df_elo <- ranking_df_elo %>%
  relocate(round, .after = team)
  
#add ranking column - rank by each simulation round 
ranking_df_elo <- ranking_df_elo %>%
  group_by(round) %>%
  mutate(ranking = rank(prob))
  
#convert to the ranking column to percentage 
percent_elo <- round(table(ranking_df_elo$team, ranking_df_elo$ranking)/ 1000 *100, 1)
percent_elo <- as.data.frame(percent_elo)
percent_elo <- setNames(percent_elo, c('team','standings', 'probability'))

3.0 Plotting confusion matrix

Plot the confusion matrix

#create confusion matrix
ggplot(data = percent_elo, mapping = aes(x = standings, y = team)) +
  geom_tile(aes(fill = probability), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", probability)), vjust = 1) + 
  scale_fill_gradient(low = "white", high = "Red") + 
  ggtitle("Elo model standings")