This is part of a series of pages related to EPL Away Wins:

After the exploratory data analysis, we will now investigate modelling for EPL data (using tidymodels). The available data set contains 6508 matches. We will remove the latest 20% of matches for use in assessing performance as a test set. Here are the first 10 rows of the data.

We will remove rows with missing data from selected features. We lose 2% of the data by removing missing rows.

Feature Creation

We create new features for Summed Features, i.e. sum of the last 4 shots on target, corners, fouls, goals scored, goals conceded for the Home team, Away team and their opponents.

Take out Test Data

We remove the final 20% of results to act as the test data. We also randomly define the stratified 5-fold crossvalidation splits for the training data.

Training data rows = 5114 
Proportion of Away Wins = 0.279 

Set up Pre-treatment Recipes

Firstly, we define the variables and model formula.

We have a formula of [AwayWin ~ sum_HST + sum_HC + sum_HGS + sum_HGC + sum_AST + sum_AC + sum_AF + sum_AGS + sum_AGC + sum_HoppST + sum_HoppC + sum_AoppST + sum_AoppC + winpc_H + winpc_A + top6_perfH + top6_perfA + Dist + Date + HomeFin1 + HomeFin2 + AwayFin1 + AwayFin2].

Then we define the pretreatment recipes. This includes removing correlated numeric variables, normalizing all numeric variables and turning categorical variables into dummy variables. As well as Month, we will create dummies for Day of Week.

recipe <-
  train %>%
  recipe(formula) %>%
  step_corr(all_numeric()) %>%
  step_normalize(all_numeric()) %>%
  step_date(Date, features = c("dow", "month")) %>%
  step_rm(Date) %>%
  step_dummy(Date_dow, Date_month)

What does the subsequent training frame look like?

Feature Importance

To estimate the relative importance of the features, we fit an unoptimised random forest model to the training data and determine variable importance from the model.

We can see that all the Date variables have relatively little importance when predicting Away wins. We will remove the Date variables going forward.

Generalized Linear Model

We will create a tuning workflow in order to optimise hyperparameters for our logistic regression model. As explained above, we will first redefine our pretreatment recipe to remove the Date features.

recipe <-
  train %>%
  recipe(formula) %>%
  step_corr(all_numeric()) %>%
  step_normalize(all_numeric()) %>%
  step_rm(Date)

We will be optimising mixture and penalty values for our glmnet model, using crossvalidation results of our training data. AUC for the ROC curve and accuracy metrics will be evaluated for each set of predictions.

model_glm <- 
  logistic_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")

grid_glm <- 
  grid_max_entropy(
    penalty(range = c(0.001, 0.1), trans = NULL),
    mixture(range = c(0, 1)),
    size = 30)

my_metrics <- metric_set(roc_auc, accuracy)

After crossvalidating for a range of mixture and penalty values, let’s plot values of ROC AUC and list the best models.

Now let’s look at the accuracy metrics.

We see some differences between the models. Lower mixture and penalty values seem to give more consistently high AUC results. To obtain the most general model, it is better to choose a higher penalty value that gives a performance similar to the best performance. In terms of AUC, a penalty of around 0.03 would seem to be suitable. Now we will choose values of mixture = 0.3 and penalty = 0.03, train a model using the training data and fit the model to the test data. How does the ROC curve look for the test set and what is the AUC of the fit?

AUC value for best model is 0.756412

Potential Betting Strategies

There are two options for betting on matches, either Bet the Away win (predict it will happen) or Lay the Away win (predict it won’t happen).

Betting Probability Limits

One strategy for betting is to define a betting limit, where bets are placed when the predicted probability of success is greater than the defined limit. If we look at the test data, we can see what the outcome would have been if we had used a particular betting limit for this data. Note that there is no guarantee at all that the same outcome would be achieved for future data.

Away Win Bets

Looking at a range of betting limits for Away Win Bets gives the following results for the test set.

For example, if we had used a prediction limit of 0.55, we would have made 86 bets with resulting precision of 79% correct and a total profit of 9.38 with a one dollar stake per bet. Total stake would have been 86, so profit ratio would have been 10.9% of total stake.

For this data, the best profit would have been made with a betting limit of 0.56 (profit 16.8% of stake).

Away Win Lays

Looking at a range of betting limits for Away Win Lays gives the following results for the test set. Note that historic Lay odds are not generally available, so a correction has been made to the average Away win odds Lay odds (e.g. those available at Betfair) are generally higher than the average Away win odds.

A limit of 0.91 would have given profit of 49.97 from 64 bets with a precision of 98% correct. The total required stake for this profit would have been 1417, so profit ratio would have been 3.5%.

Prediction to Odds Probability Ratio

If we calculate the probability of the result implied by the betting odds and compare with the probability determined by the model, we can examine the ratio of model probability / implied odds probability. This has the potential to show where we may have an advantage over the bookmaker odds.

Away Win Bets

Let’s have a look at how the ratio of my prediction / implied bet probability affects the Bet metrics.

For this data, if we had used a ratio limit of 1.8, we would have made 75 bets for a profit of 36.73, with a precision of 11%. So a total stake of 75 and profit ratio of 49%.

Away Win Lays

Let’s have a look at how the ratio of my prediction / implied lay probability affects the Lay metrics.

if we had used a ratio limit of 1.00, we would have made 593 bets for a profit of 61.35, with a precision of 59% correct. This would have required a total stake of 1017 and profit ratio of 6.0%.


End

