Rating systems are used in wide range of areas. Use in sport is one of them. It has a big market space in the betting industry to predict the winner of a sport
The final aim of this is to find the standing position of the top 8 teams as if the in the AFL 2021 season final has not happened yet.
First, we want to obtain a rating ranking of AFL team up when the regular season of 2021 have finished.
Load necessary packages
library(dplyr)
library(PlayerRatings) #for rating system models
library(ggplot2)
library(caret)
We are going to use the fitzRoy package
#Load the fitzRoy package
library(fitzRoy)
#Use "fetch_player_stats_afltables" function to obtain data from 2021
afl21 <- fetch_player_stats_afltables(season = 2021)
#Check the rounds in each dataset
unique(afl21$Round)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "QF" "EF" "SF" "PF" "GF"
This dataset includes the final tournament.
Because we only want the data to include the regular season, lets delete all the final data of 2021 season
#drop final data
afl21 <- afl21[!(afl21$Round == 'QF'|afl21$Round == 'EF'|afl21$Round == 'SF'| afl21$Round == 'PF'| afl21$Round == 'GF'),]
#Change round to week
afl21 <- afl21 %>%
mutate(Week = ifelse(Round == 1, 1, NA),
Week = ifelse(Round == 2, 2, Week),
Week = ifelse(Round == 3, 3, Week),
Week = ifelse(Round == 4, 4, Week),
Week = ifelse(Round == 5, 5, Week),
Week = ifelse(Round == 6, 6, Week),
Week = ifelse(Round == 7, 7, Week),
Week = ifelse(Round == 8, 8, Week),
Week = ifelse(Round == 9, 9, Week),
Week = ifelse(Round == 10, 10, Week),
Week = ifelse(Round == 11, 11, Week),
Week = ifelse(Round == 12, 12, Week),
Week = ifelse(Round == 13, 13, Week),
Week = ifelse(Round == 14, 14, Week),
Week = ifelse(Round == 15, 15, Week),
Week = ifelse(Round == 16, 16, Week),
Week = ifelse(Round == 17, 17, Week),
Week = ifelse(Round == 18, 18, Week),
Week = ifelse(Round == 19, 19, Week),
Week = ifelse(Round == 20, 20, Week),
Week = ifelse(Round == 21, 21, Week),
Week = ifelse(Round == 22, 22, Week),
Week = ifelse(Round == 23, 23, Week),
)
Now there are some columns in this dataset that we wont be using. Lets delete the unneccesary columns for this study.
We also need to create a new column called “score” that specifies the winner of the match in binary variable.
#select needed columns
afl21 <- select(afl21, "Date", "Week", "Venue", "Home.team", "Away.team",
"Home.score", "Away.score")
# Create new column of score(s)
afl21 <- afl21 %>%
mutate(score = ifelse(Home.score > Away.score, 1, 0),
score = ifelse(Home.score == Away.score, 0.5, score))
head(afl21)
## # A tibble: 6 x 8
## Date Week Venue Home.team Away.team Home.score Away.score score
## <date> <dbl> <chr> <chr> <chr> <int> <int> <dbl>
## 1 2021-03-18 1 M.C.G. Richmond Carlton 105 80 1
## 2 2021-03-18 1 M.C.G. Richmond Carlton 105 80 1
## 3 2021-03-18 1 M.C.G. Richmond Carlton 105 80 1
## 4 2021-03-18 1 M.C.G. Richmond Carlton 105 80 1
## 5 2021-03-18 1 M.C.G. Richmond Carlton 105 80 1
## 6 2021-03-18 1 M.C.G. Richmond Carlton 105 80 1
As you can see above, we have multiple same rows of data set in each column. This is because this data also consisted of each player statistics. Lets change the table so that a match only appears once.
#Make table game unique by "date", "venue" and "week"
afl21 <- unique(afl21, by = "Date"&"Venue"&"Week")
The following code splits the dataset into train and tests
#training /testing----
#set up the sample configuration
smp_size <- floor(0.70 * nrow(afl21))
set.seed(2)
#sample dataset
predRandom <- afl21[sample(nrow(afl21)),]
#alocate 70% as training and 30% as testing dataset
afl_train <- predRandom[1:smp_size,]
afl_test <- predRandom[(smp_size + 1): nrow(predRandom),]
Now we are ready to construct a ranking rating system! But before that, lets calculate the home game advantage score.
In AFL, teams who play as there home game has an advantage of winning. Lets make sure we take that into consideration when constructing a ranking rating system
# Calculate the home game advantage ratio
sum(afl21$Home.score) / sum(afl21$Away.score)
## [1] 1.044907
For obtaining a ranking rating system we are going to use a Elo model. This model implements the elo rating system for estimating the relative skill level of players of two players or teams.
In Elo model, it requires three variables
To construct an accurate rating system we would need to find the optimal values for above variables (excluding the gamma). Lets do some hyper parameter tuning to find this:
# Build a function for Elo model that takes init and kfac as arguments
# Should return a confusion matrix for predicted tips and final Elo rating model
elo_score <- function(init, kfac, data, status = NULL) {
predictweeks <- unique(data$Week)
model_vars <- c('Week', 'Home.team', 'Away.team', 'score')
if(is.null(status)){
predictweeks <- predictweeks[-1]
elomodel <- elo(data[data$Week == 1, model_vars],
init = init,
kfac = kfac,
gamma = 1.23,
history = T)
} else {
elomodel <- status
}
data[c('p_elo')] <- NA
for(w in predictweeks){
pred_week <- subset(data, Week == w)
pred_elo <- predict(elomodel,
newdata = pred_week[model_vars],
gamma = 1.23,
tng = 1, trat = 1500)
data$p_elo[data$Week == w] <- pred_elo
elomodel <- elo(pred_week[model_vars],
status = elomodel$ratings,
init = init,
gamma = 1.23,
kfac = kfac,
history = T)
}
data$tip_elo <- ifelse(data$p_elo > 0.5, 1, 0)
cm_elo <- confusionMatrix(data = factor(data$tip_elo, levels = c(0,0.5,1)),
reference = factor(data$score, levels = c(0,0.5,1)))
return(
list(elomodel,
cm_elo)
)
}
Create a grid parameter values to test
params <- expand.grid(init = seq(1400, 3000, by = 100),
kfac = seq(20,80, by = 5))
For each parameter in “params”, calculate the accuracy on the afl_train data
params$accuracy <- apply(params, 1, function(x)
elo_score(x[1], x[2], afl_train)[[2]]$overall['Accuracy'])
# What was the best combination of variables?
subset(params, accuracy == max(params$accuracy))
## init kfac accuracy
## 1 1400 20 0.6153846
## 2 1500 20 0.6153846
Using the afl_test data, use the best combination parameter variables to get an estimate of expected accuracy
test_score_elo <- elo_score(init = 1400,
kfac = 20,
data = afl_test,
status = elo_score(1400,20, afl_train)[[1]])
This is the output
test_score_elo
## [[1]]
##
## Elo Ratings For 18 Players Playing 196 Games
##
## Player Rating Games Win Draw Loss Lag
## 1 Melbourne 1507 22 18 1 3 4
## 2 Port Adelaide 1482 21 16 0 5 1
## 3 Geelong 1473 21 15 0 6 6
## 4 Sydney 1468 22 16 0 6 3
## 5 Western Bulldogs 1460 21 14 0 7 5
## 6 Brisbane Lions 1451 23 15 0 8 4
## 7 Greater Western Sydney 1400 22 10 1 11 8
## 8 St Kilda 1400 22 11 0 11 3
## 9 West Coast 1398 22 11 0 11 0
## 10 Essendon 1394 22 10 0 12 0
## 11 Richmond 1380 23 10 1 12 0
## 12 Hawthorn 1374 22 8 2 12 1
## 13 Fremantle 1368 22 9 0 13 5
## 14 Carlton 1347 22 7 0 15 4
## 15 Collingwood 1338 21 6 0 15 2
## 16 Adelaide 1332 21 6 0 15 0
## 17 Gold Coast 1326 22 7 0 15 8
## 18 North Melbourne 1303 21 4 1 16 2
##
##
## [[2]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 0.5 1
## 0 17 0 8
## 0.5 0 0 0
## 1 12 0 23
##
## Overall Statistics
##
## Accuracy : 0.6667
## 95% CI : (0.5331, 0.7831)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.01343
##
## Kappa : 0.3296
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 0.5 Class: 1
## Sensitivity 0.5862 NA 0.7419
## Specificity 0.7419 1 0.5862
## Pos Pred Value 0.6800 NA 0.6571
## Neg Pred Value 0.6571 NA 0.6800
## Prevalence 0.4833 0 0.5167
## Detection Rate 0.2833 0 0.3833
## Detection Prevalence 0.4167 0 0.5833
## Balanced Accuracy 0.6641 NA 0.6641
Elo rating system is done! Now lets find the accuracy of the model
#Rating system test
#list all weeks in data set
testpredictweek <- unique(afl_test$Week)
#remove first week (use it for initial model)
testpredictweek <- testpredictweek[-1]
#Create vector containing columns passed the rating model function
model_vars <- c('Week', 'Home.team', 'Away.team', 'score')
# BUILD 2 models(elo,steph) using first week ----
#Elo ----
testelomodel <- elo(afl_test[afl_test$Week == 1, model_vars],
init = 1400,
kfac = 20,
gamma = 1.04,
history = T)
print(testelomodel)
##
## Elo Ratings For 6 Players Playing 3 Games
##
## Player Rating Games Win Draw Loss Lag
## 1 Port Adelaide 1410 1 1 0 0 0
## 2 Western Bulldogs 1410 1 1 0 0 0
## 3 Adelaide 1410 1 1 0 0 0
## 4 Geelong 1390 1 0 0 1 0
## 5 Collingwood 1390 1 0 0 1 0
## 6 North Melbourne 1390 1 0 0 1 0
#initialise new column in afl_tbl for prob estimates
afl_test[c('p_elo')] <- NA
#Loop through weeks where we want to generate predictions ----
for (w in testpredictweek) {
#identify the round of games we are trying to predict
testpred_week <- subset(afl_test, week = w)
#Make predictions (set tng = 1 and trat = 1600, deviation 300)
testpred_elo <- predict(testelomodel,
newdata = testpred_week[model_vars],
gamma = 1.07,
tng = 1, trat = 1500)
#Add the predicted probabilities to the test_data dataframe
afl_test$p_elo[afl_test$Week == w] <- testpred_elo
#Update the model using status option
testelomodel <- elo(testpred_week[model_vars],
status = testelomodel$ratings,
init = 1400,
gamma = 1.04,
kfac = 20,
history = T)
}
#tipping the winner of matches
afl_test$tip_elo <- ifelse(afl_test$p_elo > 0.5, 1, 0)
#Build the confusion matrix ----
testcm_elo <- confusionMatrix(data = factor(afl_test$tip_elo, levels = c(0,0.5,1)),
reference = factor(afl_test$score, levels = c(0,0.5,1)))
testcm_elo
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 0.5 1
## 0 17 0 19
## 0.5 0 0 0
## 1 11 0 8
##
## Overall Statistics
##
## Accuracy : 0.4545
## 95% CI : (0.3197, 0.5945)
## No Information Rate : 0.5091
## P-Value [Acc > NIR] : 0.8274
##
## Kappa : -0.0971
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 0.5 Class: 1
## Sensitivity 0.6071 NA 0.2963
## Specificity 0.2963 1 0.6071
## Pos Pred Value 0.4722 NA 0.4211
## Neg Pred Value 0.4211 NA 0.4722
## Prevalence 0.5091 0 0.4909
## Detection Rate 0.3091 0 0.1455
## Detection Prevalence 0.6545 0 0.3455
## Balanced Accuracy 0.4517 NA 0.4517
##compare model accuracies
table1 <- data.frame(testcm_elo$overall) %>%
round(3) %>%
print()
## testcm_elo.overall
## Accuracy 0.455
## Kappa -0.097
## AccuracyLower 0.320
## AccuracyUpper 0.594
## AccuracyNull 0.509
## AccuracyPValue 0.827
## McnemarPValue NaN
The accuracy of this model was 0.45.
Now moving on to simulations…..
Lets predict the standing position of finals using the actual top 8 teams
We first have to prepare the data set from the p_elo column of afl_test data set
#ELO MODEL
afl_test$p_elo_away <- (1 - afl_test$p_elo)
#assign unique letter to each top 8 teams
A <- "Melbourne"
B <- "Brisbane Lions"
C <- "Western Bulldogs"
D <- "Essendon"
E <- "Sydney"
G <- "Greater Western Sydney"
H <- "Port Adelaide"
I <- "Geelong"
#Team A - Melbourne ----
A.df <- afl_test %>%
filter(Home.team == A | Away.team == A) %>%
arrange(Date)
A.df$p_elo <- ifelse(A.df$Home.team == A, A.df$p_elo, NA)
A.df$p_elo_away <- ifelse(A.df$Away.team == A, A.df$p_elo_away, NA)
A.df$p_elo <- ifelse(A.df$p_elo %in% NA, A.df$p_elo_away, A.df$p_elo)
A.df$Team <- A
A.df <- A.df %>%
select(Date, Week, Team, p_elo) %>%
arrange(Date) %>%
filter(Week != "1")
#SD mean
A_team_sd_elo <- sd(A.df$p_elo)
A_team_mean_elo <- mean(A.df$p_elo)
# Team B- Brisbane Lions ----
B.df <- afl_test %>%
filter(Home.team == B | Away.team == B) %>%
arrange(Date)
B.df$p_elo <- ifelse(B.df$Home.team == B, B.df$p_elo, NA)
B.df$p_elo_away <- ifelse(B.df$Away.team == B, B.df$p_elo_away, NA)
B.df$p_elo <- ifelse(B.df$p_elo %in% NA, B.df$p_elo_away, B.df$p_elo)
B.df$Team <- B
B.df <- B.df %>%
select(Date, Week, Team, p_elo) %>%
arrange(Date) %>%
filter(Week != "1")
#SD mean
B_team_sd_elo <- sd(B.df$p_elo)
B_team_mean_elo <- mean(B.df$p_elo)
# Team C- Western Bulldogs ----
C.df <- afl_test %>%
filter(Home.team == C | Away.team == C) %>%
arrange(Date)
C.df$p_elo <- ifelse(C.df$Home.team == C, C.df$p_elo, NA)
C.df$p_elo_away <- ifelse(C.df$Away.team == C, C.df$p_elo_away, NA)
C.df$p_elo <- ifelse(C.df$p_elo %in% NA, C.df$p_elo_away, C.df$p_elo)
C.df$Team <- C
C.df <- C.df %>%
select(Date, Week, Team, p_elo) %>%
arrange(Date) %>%
filter(Week != "1")
#SD mean
C_team_sd_elo <- sd(C.df$p_elo)
C_team_mean_elo <- mean(C.df$p_elo)
# Team D- Essendon ----
D.df <- afl_test %>%
filter(Home.team == D | Away.team == D) %>%
arrange(Date)
D.df$p_elo <- ifelse(D.df$Home.team == D, D.df$p_elo, NA)
D.df$p_elo_away <- ifelse(D.df$Away.team == D, D.df$p_elo_away, NA)
D.df$p_elo <- ifelse(D.df$p_elo %in% NA, D.df$p_elo_away, D.df$p_elo)
D.df$Team <- D
D.df <- D.df %>%
select(Date, Week, Team, p_elo) %>%
arrange(Date) %>%
filter(Week != "1")
#SD mean
D_team_sd_elo <- sd(D.df$p_elo)
D_team_mean_elo <- mean(D.df$p_elo)
# Team E- Sydney ----
E.df <- afl_test %>%
filter(Home.team == E | Away.team == E) %>%
arrange(Date)
E.df$p_elo <- ifelse(E.df$Home.team == E, E.df$p_elo, NA)
E.df$p_elo_away <- ifelse(E.df$Away.team == E, E.df$p_elo_away, NA)
E.df$p_elo <- ifelse(E.df$p_elo %in% NA, E.df$p_elo_away, E.df$p_elo)
E.df$Team <- E
E.df <- E.df %>%
select(Date, Week, Team, p_elo) %>%
arrange(Date) %>%
filter(Week != "1")
#sd mean
E_team_sd_elo <- sd(E.df$p_elo)
E_team_mean_elo <- mean(E.df$p_elo)
# Team G- GWS ----
G.df <- afl_test %>%
filter(Home.team == G | Away.team == G) %>%
arrange(Date)
G.df$p_elo <- ifelse(G.df$Home.team == G, G.df$p_elo, NA)
G.df$p_elo_away <- ifelse(G.df$Away.team == G, G.df$p_elo_away, NA)
G.df$p_elo <- ifelse(G.df$p_elo %in% NA, G.df$p_elo_away, G.df$p_elo)
G.df$Team <- G
G.df <- G.df %>%
select(Date, Week, Team, p_elo) %>%
arrange(Date) %>%
filter(Week != "1")
#SD
G_team_sd_elo <- sd(G.df$p_elo)
G_team_mean_elo <- mean(G.df$p_elo)
# Team H- Port Adelaide ----
H.df <- afl_test %>%
filter(Home.team == H | Away.team == H) %>%
arrange(Date)
H.df$p_elo <- ifelse(H.df$Home.team == H, H.df$p_elo, NA)
H.df$p_elo_away <- ifelse(H.df$Away.team == H, H.df$p_elo_away, NA)
H.df$p_elo <- ifelse(H.df$p_elo %in% NA, H.df$p_elo_away, H.df$p_elo)
H.df$Team <- H
H.df <- H.df %>%
select(Date, Week, Team, p_elo) %>%
arrange(Date) %>%
filter(Week != "1")
#SD
H_team_sd_elo <- sd(H.df$p_elo)
H_team_mean_elo <- mean(H.df$p_elo)
# Team I- Geelong ----
I.df <- afl_test %>%
filter(Home.team == I | Away.team == I) %>%
arrange(Date)
I.df$p_elo <- ifelse(I.df$Home.team == I, I.df$p_elo, NA)
I.df$p_elo_away <- ifelse(I.df$Away.team == I, I.df$p_elo_away, NA)
I.df$p_elo <- ifelse(I.df$p_elo %in% NA, I.df$p_elo_away, I.df$p_elo)
I.df$Team <- I
I.df <- I.df %>%
select(Date, Week, Team, p_elo) %>%
arrange(Date) %>%
filter(Week != "1")
#SD
I_team_sd_elo <- sd(I.df$p_elo)
I_team_mean_elo <- mean(I.df$p_elo)
The following code calculates the simulation.
This time, the simulation will be run 1000 times. Feel free to change the simulation number!
#Simulation of all individual team
A_team_sim_elo <- rnorm(1000, A_team_mean_elo, A_team_sd_elo)
B_team_sim_elo <- rnorm(1000, B_team_mean_elo, B_team_sd_elo)
C_team_sim_elo <- rnorm(1000, C_team_mean_elo, C_team_sd_elo)
D_team_sim_elo <- rnorm(1000, D_team_mean_elo, D_team_sd_elo)
E_team_sim_elo <- rnorm(1000, E_team_mean_elo, E_team_sd_elo)
G_team_sim_elo <- rnorm(1000, G_team_mean_elo, G_team_sd_elo)
H_team_sim_elo <- rnorm(1000, H_team_mean_elo, H_team_sd_elo)
I_team_sim_elo <- rnorm(1000, I_team_mean_elo, I_team_sd_elo)
# New data frame with all simulation
simulation_df_elo <- data.frame(A_team_sim_elo, B_team_sim_elo, C_team_sim_elo,
D_team_sim_elo, E_team_sim_elo, G_team_sim_elo,
H_team_sim_elo, I_team_sim_elo)
#Change column names back to team names
simulation_df_elo <- setNames(simulation_df_elo,
c('Melbourne', "Brisbane Lions", "Western Bulldogs",
"Essendon", "Sydney", "Greater Western Sydney",
"Port Adelaide", "Geelong"))
Using the created dataset, our aim is to create a confusion matrix plot which shows the percentage of each standings
But first the data frame needs to be fixed and ranking column has to be added. Run the following code:
#remember names
simulation_df_elo$sim_round <- c(1:1000)
n <- simulation_df_elo$sim_round
#transpose all but the last column
simulation_df_elo <- as.data.frame(t(simulation_df_elo[,-9]))
colnames(simulation_df_elo) <- n
#unlist the column
ranking_df_elo <- data.frame(prob = unlist(simulation_df_elo))
ranking_df_elo$team <- c('Melbourne', "Brisbane Lions", "Western Bulldogs",
"Essendon", "Sydney", "Greater Western Sydney", "Port
Adelaide", "Geelong")
#relocate
ranking_df_elo <- ranking_df_elo %>%
relocate(team, .before = prob)
#round no.
x = 1:1000
ranking_df_elo$round <- rep(x, each = 8)
#relocate
ranking_df_elo <- ranking_df_elo %>%
relocate(round, .after = team)
#add ranking column - rank by each simulation round
ranking_df_elo <- ranking_df_elo %>%
group_by(round) %>%
mutate(ranking = rank(prob))
#convert to the ranking column to percentage
percent_elo <- round(table(ranking_df_elo$team, ranking_df_elo$ranking)/ 1000 *100, 1)
percent_elo <- as.data.frame(percent_elo)
percent_elo <- setNames(percent_elo, c('team','standings', 'probability'))
Plot the confusion matrix
#create confusion matrix
ggplot(data = percent_elo, mapping = aes(x = standings, y = team)) +
geom_tile(aes(fill = probability), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", probability)), vjust = 1) +
scale_fill_gradient(low = "white", high = "Red") +
ggtitle("Elo model standings")