Soccer Prediction Challenge 2023, FC Slocker

Introduction

The 2023 Soccer Prediction Challenge is an international machine learning competition that invites the machine learning community to predict the outcomes of a set of soccer matches from leagues worldwide played at the beginning of April 2023. Participants may use any publicly available data set to train their machine learning models, or the training set provided by the organizers, or both.

I decided to join the challenge to practice data integration and modelling skills with real football data.

As a programming language, R was chosen, as an instrument fitting quite well for such finite exercise of data processing and making the predictions. dplyr package was extensively in use during the coding, giving such options as piping, grouping the data and cumulative calculation.

The methodological approach was at first to create a minimum viable product in order to be able to deliver the final results. And then to improve the quality of predictions by increasing complexity of the solution.

Team value

The training set from organizers includes about 300,000 matches played in 70+ various leagues across the globe in the period between 2000 and 2023:

Fundamentally, a prediction of scores and probabilities is a finding of a relative strength between two playing teams in a match. Therefore, I started to search for some formal reflection of that strength in the existing data sets. Additionally, I planned to fetch some external data, in order to create a distinctive solution from other competitors.

Having some experience with Transfermarkt before, I realized that the total value of each team, constituting from the values of its single players, would fit very well for such a role. Here and later all values are in Euros. I web-scraped the team values from Transfermarkt using rvest package. The data are presented in the historical perspective, meaning we can find a value of any team on any particular date with a precision of a half-month:

The team value data are available from 2011, so I limited the training data set to ca. 120,000 entries. But this amount still should be enough to train the model: the prediction set consists of 630 matches, what is less that 1% of the training set list. After joining the training set and the value data from Transfermarkt, we receive the following table:

Then the Value_Ratio parameter is calculated. It is positive when a home team is more expensive, negative - if an away team, and 0, if the teams are equal:

    Value_Ratio = case_when(
      Value_Home >= Value_Away ~ Value_Home / Value_Away - 1,
      Value_Home < Value_Away ~ (Value_Away / Value_Home - 1) * (-1))

Upon these integrated and cleansed data, using caret package, I built two linear regression models (with K-Fold Cross Validation). They define goals scored by home and away teams (HS and AS), depending on Value_Ratio:

Lin_Model_HS_Value_Ratio <- train(HS~Value_Ratio,data=TS_Value_Complete,method="lm",trControl=trainControl(method="cv",number=5))
Lin_Model_AS_Vvalue_Ratio <- train(AS~Value_Ratio,data=TS_Value_Complete,method="lm",trControl=trainControl(method="cv",number=5))

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Coefficients:
## (Intercept)  Value_Ratio  
##      1.4487       0.1663

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Coefficients:
## (Intercept)  Value_Ratio  
##      1.1439      -0.1237

## Linear Regression 
## 
## 122914 samples
##      1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 98332, 98332, 98330, 98331, 98331 
## Resampling results:
## 
##   RMSE      Rsquared    MAE      
##   1.203246  0.03261776  0.9690217
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

## Linear Regression 
## 
## 122914 samples
##      1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 98331, 98332, 98331, 98330, 98332 
## Resampling results:
## 
##   RMSE      Rsquared    MAE      
##   1.082491  0.02252732  0.8413545
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Graphically, it can be represented with help of ggplot2 package:

in green - home scored goals,
in red - away scored goals.

Both positive intercept coefficients (1.449 and 1.1452) indicate that in a match of equal teams (Value_Ratio = 0), both teams will likely score at least one goal each.

The intersection of both linear regressions occurs at Value_Ratio ~ -1.

It means that an even game is expected when a home team is twice cheaper than an away team.

E.g. Value_Ratio = (Value_Away=100M/Value_Home=50M-1)*(-1) = (100M/50M-1)*(-1) = -1.

This reflects an advantage playing at home.

Another model - multinominal logistic regression - is made to predict probabilities for win, draw, and loss. For that purpose, nnet package is used:

Log_Model_WDL_Value_Ratio <- multinom(WDL ~ Value_Ratio, data = TS_Value_Complete)

## Call:
## multinom(formula = WDL ~ Value_Ratio, data = TS_Value_Complete)
## 
## Coefficients:
##   (Intercept) Value_Ratio
## L  0.02881784  -0.1744404
## W  0.44719601   0.2126819
## 
## Residual Deviance: 259168.4 
## AIC: 259176.4

Unfortunately, the parameters for model deviance can’t tell us much about how good an exact model in these particular circumstances is. For that reason, I validated my probabilities against the odds of some well known bookmaker (BM).

Below are bookmaker’s and my probabilities for WDL:

Root mean square error (RMSE) shows deviation between the probabilities: the bigger the value, the worse my prediction is.

RMSE = sqrt(((BM_W - My_W)^2 + (BM_D - My_D)^2 + (BM_L - My_D)^2)/3)

I realized that my predictions are less ‘aggressive’ then those from the bookmaker, esp. with such teams as Manchester City, Liverpool, or Arsenal. Then I noticed another match where the difference is visible as well:

Brighton’s team value is smaller than West Ham’s. It gives advantage to the away team.

But Brighton is playing at home, and it benefits them.

The bookmaker votes strongly for Brighton, giving them more than 50% chance of win.

While my predictions are more conservative, with home win prediction of 4 out of 10.

There must be a reason why Brighton is ‘weighted heavier’ in the eyes of the bookmaker.

If you look at the English Premier League table, you will realize that one of the candidates for this reason would be a difference in the current table positions: Brighton is higher in the table, and earned more points than West Ham:

EPL table as of 26.02.2023

Generally, all the teams in whom the bookmaker had a trust were on the top of the table. For the clubs such as Liverpool or Manchester City, it’s not a surprise, and that’s why not very noticeable. But Arsenal and Brighton are performing significantly above expectations this season. Therefore, awarding them with an additional predictor seems rather reasonable.

Cumulative points

The team value data refer to the theoretical, potential strength of a team.

To show the current, actual strength, let’s use already available data from the organizers. We calculate cumulative points for each team within the current season before every match.

For each match, an additional predictor Point_STD_Diff is added:

Point_STD_Diff = Point_STD_Home - Point_STD_Away

Then we build the same models as before, but using two predictors.

Lin_Model_HS <- train(HS~Value_Ratio+Point_STD_Diff,data=TS_Value_Complete,method="lm",trControl=trainControl(method="cv",number=5))
Lin_Model_AS <- train(AS~Value_Ratio+Point_STD_Diff,data=TS_Value_Complete,method="lm",trControl=trainControl(method="cv",number=5))
Log_Model_WDL <- multinom(WDL ~ Value_Ratio + Point_STD_Diff, data = TS_Value_Complete)

Three new models have slightly better quality according to the statistical RMSE / Residual Deviance:

## Linear Regression 
## 
## 122914 samples
##      2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 98331, 98331, 98332, 98332, 98330 
## Resampling results:
## 
##   RMSE     Rsquared   MAE      
##   1.19578  0.0446394  0.9590288
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

## Linear Regression 
## 
## 122914 samples
##      2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 98332, 98331, 98331, 98332, 98330 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.077327  0.0318632  0.8389198
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

## Call:
## multinom(formula = WDL ~ Value_Ratio + Point_STD_Diff, data = TS_Value_Complete)
## 
## Coefficients:
##   (Intercept) Value_Ratio Point_STD_Diff
## L 0.008661911  -0.1200493    -0.01608641
## W 0.441590931   0.1565452     0.01631347
## 
## Residual Deviance: 257039.7 
## AIC: 257051.7

Additional validation is performed against the bookmaker’s predictions. The results are more promising:

To prove numerically, an average deviation (RMSE) with WDL ~ Value_Ratio:

## [1] 0.08398743

An average deviation (RMSE_P) with WDL ~ Value_Ratio + Point_STD_Diff:

## [1] 0.07443581

After these checks, I keep the new models and use them for predictions:

Exact score

Another challenge which I tackled was to define what exact score to take.

Here is the distribution of most likely outcomes:

This is the distribution of expected exact scores:

These two pieces of data seem to be contradicting at first glance. On one hand, the likeliest outcome is a home win. On another, 1:1 as an exact score dominates over the others with 58% share. It means there are cases when we predicted W as WDL, but 1:1 as an exact score. Or maybe this is a valid behavior, and prediction of exact scores and WDL probabilities are two separate tasks?

To test this, we have to calculate an average weighted RMSE of each exact score and compare them.

I took three matches with various relative strengths between the teams:

Home team as a clear favorite: Arsenal - Bournemouth (ARSBOU)
- Bookmaker WDL odds: 1.35 / 4.75 / 9.00
- Model exact score prediction: 2:1
Home team is slightly stronger: Liverpool - Manchester Untied (LIVMUN)
- Bookmaker WDL odds: 2.30 / 3.60 / 2.87
- Model exact score prediction: 1:1
More or less equal odds for home win, draw, or loss: Everton - Aston Villa (EVEAVL)
- Bookmaker WDL odds: 2.50 / 3.20 / 2.90
- Model exact score prediction: 1:1

These are the initial odds from a bookmaker for 13 most popular exact scores (which normally in total ended up very close to 100%):

After normalization, the probabilities of each score are:

The next step is to cross join this set of 13 scores against itself and calculate a relative RSME for each pair. It will allow us to all the possible combinations of an actual score and a predicted one (13 x 13 = 169 combinations).

RMSE shows how far exact scores are, one from another:

RMSE = sqrt(((HS-prd_HS)^2+(AS-prd_AS)^2)/2)

An interesting observation: RMSE (1:0 vs 3:0) = 1.41, while RMSE (1:0 vs 2:1) = 1.00. They are different, although deviation in the form of total number of goals is the same - 2.

This means, it’s better to spread deviation for both HS and AS (1:0 vs 2:1 case), rather than keeping them in one component only (1:0 vs 3:0 case). Let’s remember it for the future conclusions.

Now we group the data by the predicted scores and calculate weighted average values for RMSE. The smaller the value, the better score selection it is.

    RMSE_avgw_ARSBOU = sum(Prob_ARSBOU*RMSE),
    RMSE_avgw_LIVMUN = sum(Prob_LIVMUN*RMSE),
    RMSE_avgw_EVEAVL = sum(Prob_EVEAVL*RMSE)

These are separate results for each match, sorted by the best score:

As you can see, the best one for Arsenal - Bournemouth is 2:0, tightly followed by 2:1 (predicted by the model).

For both Liverpool - Manchester Untied and Everton - Aston Villa, the best is 1:1 (also predicted by the model).

So our hypothesis is proven: 1:1 can be the score with potentially the lowest RMSE, even the selected probable outcome is W.

Explanation for this is the following. In the competition, we are eventually rewarded according to total/average RMSE across all the matches, and not by number of scores predicted precisely.

Another notice is that 1:1 is the score which lies ‘in the middle’. This position helps to smooth over the deviation effect: even if an exact score is wrong many times, the resulting RMSE will be lower.

The prediction table for exact scores:

Conclusion

In order to predict exact scores and probabilities for win/draw/loss, I combined the data provided by the organizers and the data on team values from Transfermarkt.

After data harmonization, the predictive models were created, tested and used.

The most challenging parts of the whole exercise were:

finding proper predictors for the models,
validating the results,
technical implementation of iterative web-scraping of team value data.