Loading and preparing our data set

The goal of this project is to predict the Elo score of a rugby union team. My previous work explains the transformation and the use of multiple packages in order to create the Elo score and the final score(click on the link to have more information : https://rpubs.com/Patault_M/959663).

In this paper, one team will be selected (Australia) and nine years of data will be kept.

# Loading the data set
RU_data <- read.csv("results.csv")

# Loading the package
library(tidyverse)

# Transform the date column
RU_data$date <- as.Date(RU_data$date)

# Remove the data older than 2002
RU_data <- RU_data%>%
  filter(date > '2013-01-01')

# Adding the extra column
RU_data <- RU_data%>%
  mutate(H_Score = ifelse(home_score > away_score, 1,
                          ifelse(home_score < away_score , 0 ,0.5)))

# Loading the library
library(PlayerRatings)

# Creating the new data frame
elo <- elo::elo.run(formula = H_Score ~ home_team + away_team,
                      initial_elos = 2500,
                      k = 50,
                      data = RU_data) %>%
    as.data.frame()

# Transforming the row numbers as identifier
elo <-rownames_to_column(elo)

RU_data <-rownames_to_column(RU_data)
  
# Remove the non Australian team  
Home <- elo%>%
  filter(team.A == "Australia")

Away <- elo%>%
  filter(team.B == "Australia")

# Select the columns related to the Team
Home <- Home%>%
  select(rowname, team.A, elo.A)

Away <- Away%>%
  select(rowname, team.B, elo.B)

# Renaming the columns
Home <- Home %>% rename(team = team.A, elo = elo.A)

Away <- Away %>% rename(team = team.B, elo = elo.B)

# Binding the data frame
Australia_elo <- bind_rows(Away, Home)

# Using two for loops to add accurately the date to the new data frame
for (i in 1:nrow(Australia_elo)) {
  
  for(n in 1:nrow(RU_data)){
  
  Date <- ifelse(RU_data$rowname[n] == Australia_elo$rowname[i], RU_data$date[n], next)
  
  Australia_elo$Dates[i] <- as.Date(Date, origin="1970-01-01")
  
  }
  
}  

Australia_elo$Dates <- as.Date(Australia_elo$Dates, origin="1970-01-01")

# Arranging the matches by Date (older to last played) 
Australia_elo <- Australia_elo%>%
  arrange(Dates)

Plotting the data

Now that the data is ready, it’s time to visualize it. The anomaly, the ACF and PACF as well as the seasonality plot will be produced.

# Loading the `timetk` package to plot the data 
library(timetk)

# Looking for the data and it's trend
Australia_elo %>% 
  plot_time_series(.date_var = Dates,
                           .value =  elo)
# Looking for any anomaly
Australia_elo%>%
  plot_anomaly_diagnostics(Dates, elo)
# Plotting ACF and PACF diagnostics
Australia_elo %>%
  plot_acf_diagnostics(.date_var = Dates,
                               .value = elo)
# Observing the seasonality
Australia_elo %>%
  plot_seasonal_diagnostics(.date_var = Dates,
                                    .value = elo)

Splitting the data and buiding the models

The observations shows that the data doesn’t have any anomalies and doesn’t possess any seasonality. No further data transformation is required. The data needs to be split into two data sets, Training and Testing. After the data is split, multiple models will be built: - Linear Model - Exponential Smoothing - Prophet - Auto Arima

# Loading the `tidymodels` package to automatically split the data
library(tidymodels)

# Splitting the data 
split <- initial_time_split(Australia_elo)

# Creating the training data set
train <- training(split)

# Creating the testing data set
test <- testing(split)


# Loading the `modeltime` package to build the following models
library(modeltime)

# Linear Regression
elo_lm <- linear_reg() %>% 
  set_engine("lm") %>% 
  fit(elo ~ Dates, data = train)

# Exponential Smoothing
elo_expsm <- exp_smoothing() %>% 
  set_engine("ets") %>% 
  fit(elo ~ Dates, data = train)

# Prophet
elo_prophet <- prophet_reg() %>% 
  set_engine("prophet") %>% 
  fit(elo ~ Dates, data = train)

# Auto Arima
elo_arima <- arima_reg() %>% 
  set_engine("auto_arima") %>% 
  fit(elo ~ Dates, data = train)

Actual data compared to the models and models accuracy

A table needs to be created with all the models in it. The table need to be calibrated to the testing data. Finally the models can be plotted and the results can be compared.

elo_models_table <- modeltime_table(
  elo_lm,
  elo_expsm,
  elo_prophet,
  elo_arima
)

elo_table_calibrate <- elo_models_table %>% 
  modeltime_calibrate(new_data = train)

elo_table_calibrate %>% 
  modeltime_forecast(
    actual_data = Australia_elo,
    new_data = test
  ) %>% 
  plot_modeltime_forecast()
elo_table_calibrate %>% 
  modeltime_accuracy()
## # A tibble: 4 × 9
##   .model_id .model_desc              .type   mae  mape  mase smape  rmse     rsq
##       <int> <chr>                    <chr> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1         1 LM                       Test   42.1  2.75 2.24   2.74  50.1 0.00187
## 2         2 ETS(M,N,N)               Fitt…  18.6  1.21 0.988  1.21  21.2 0.828  
## 3         3 PROPHET                  Fitt…  21.9  1.42 1.17   1.42  26.9 0.715  
## 4         4 ARIMA(2,0,0) WITH NON-Z… Fitt…  17.7  1.15 0.945  1.15  20.6 0.832

As we can observe on the graph and on the result, the Arima model is the more accurate with the lowest Mean Average Error of 17.74 and the highest R squared of 0.83.

Looking in the future

# Forecasting the coming 5 years
elo_table_calibrate %>% 
  modeltime_forecast(h = "5 years", actual_data = Australia_elo) %>% 
  plot_modeltime_forecast()

In this last plot we can see that all models are pretty much linear apart from the Prophet model. The prophet model predict lots of variation, while the 3 others don’t anticipate any changes.

Discussion

Using these models to predict wouldn’t be great as they don’t anticipate any realistic evolution of the Elo score. More research should be done to find some better predictive models.