The 2023 Soccer Prediction Challenge is an international machine learning competition that invites the machine learning community to predict the outcomes of a set of soccer matches from leagues worldwide played at the beginning of April 2023. Participants may use any publicly available data set to train their machine learning models, or the training set provided by the organizers, or both.
I decided to join the challenge to practice data integration and modelling skills with real football data.
As a programming language, R was chosen, as an instrument fitting quite well for such finite exercise of data processing and making the predictions. dplyr package was extensively in use during the coding, giving such options as piping, grouping the data and cumulative calculation.
The methodological approach was at first to create a minimum viable product in order to be able to deliver the final results. And then to improve the quality of predictions by increasing complexity of the solution.
Fundamentally, a prediction of scores and probabilities is a finding of a relative strength between two playing teams in a match. Therefore, I started to search for some formal reflection of that strength in the existing data sets. Additionally, I planned to fetch some external data, in order to create a distinctive solution from other competitors.
Having some experience with Transfermarkt before, I realized that the total value of each team, constituting from the values of its single players, would fit very well for such a role. Here and later all values are in Euros. I web-scraped the team values from Transfermarkt using rvest package. The data are presented in the historical perspective, meaning we can find a value of any team on any particular date with a precision of a half-month:Then the Value_Ratio parameter is calculated. It is positive when a home team is more expensive, negative - if an away team, and 0, if the teams are equal:
Value_Ratio = case_when(
Value_Home >= Value_Away ~ Value_Home / Value_Away - 1,
Value_Home < Value_Away ~ (Value_Away / Value_Home - 1) * (-1))
Upon these integrated and cleansed data, using caret package, I built two linear regression models (with K-Fold Cross Validation). They define goals scored by home and away teams (HS and AS), depending on Value_Ratio:
Lin_Model_HS_Value_Ratio <- train(HS~Value_Ratio,data=TS_Value_Complete,method="lm",trControl=trainControl(method="cv",number=5))
Lin_Model_AS_Vvalue_Ratio <- train(AS~Value_Ratio,data=TS_Value_Complete,method="lm",trControl=trainControl(method="cv",number=5))
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Coefficients:
## (Intercept) Value_Ratio
## 1.4487 0.1663
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Coefficients:
## (Intercept) Value_Ratio
## 1.1439 -0.1237
## Linear Regression
##
## 122914 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 98332, 98332, 98330, 98331, 98331
## Resampling results:
##
## RMSE Rsquared MAE
## 1.203246 0.03261776 0.9690217
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Linear Regression
##
## 122914 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 98331, 98332, 98331, 98330, 98332
## Resampling results:
##
## RMSE Rsquared MAE
## 1.082491 0.02252732 0.8413545
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Graphically, it can be represented with help of ggplot2 package:
Both positive intercept coefficients (1.449 and 1.1452) indicate that in a match of equal teams (Value_Ratio = 0), both teams will likely score at least one goal each.
The intersection of both linear regressions occurs at Value_Ratio ~ -1.
It means that an even game is expected when a home team is twice cheaper than an away team.
E.g. Value_Ratio = (Value_Away=100M/Value_Home=50M-1)*(-1) = (100M/50M-1)*(-1) = -1.
This reflects an advantage playing at home.
Another model - multinominal logistic regression - is made to predict probabilities for win, draw, and loss. For that purpose, nnet package is used:
Log_Model_WDL_Value_Ratio <- multinom(WDL ~ Value_Ratio, data = TS_Value_Complete)
## Call:
## multinom(formula = WDL ~ Value_Ratio, data = TS_Value_Complete)
##
## Coefficients:
## (Intercept) Value_Ratio
## L 0.02881784 -0.1744404
## W 0.44719601 0.2126819
##
## Residual Deviance: 259168.4
## AIC: 259176.4
Unfortunately, the parameters for model deviance can’t tell us much about how good an exact model in these particular circumstances is. For that reason, I validated my probabilities against the odds of some well known bookmaker (BM).
Below are bookmaker’s and my probabilities for WDL:Root mean square error (RMSE) shows deviation between the probabilities: the bigger the value, the worse my prediction is.
RMSE = sqrt(((BM_W - My_W)^2 + (BM_D - My_D)^2 + (BM_L - My_D)^2)/3)
I realized that my predictions are less ‘aggressive’ then those from the bookmaker, esp. with such teams as Manchester City, Liverpool, or Arsenal. Then I noticed another match where the difference is visible as well:
Brighton’s team value is smaller than West Ham’s. It gives advantage to the away team.
But Brighton is playing at home, and it benefits them.
The bookmaker votes strongly for Brighton, giving them more than 50% chance of win.
While my predictions are more conservative, with home win prediction of 4 out of 10.
There must be a reason why Brighton is ‘weighted heavier’ in the eyes of the bookmaker.
If you look at the English Premier League table, you will realize that one of the candidates for this reason would be a difference in the current table positions: Brighton is higher in the table, and earned more points than West Ham:
EPL table as of 26.02.2023
Generally, all the teams in whom the bookmaker had a trust were on the top of the table. For the clubs such as Liverpool or Manchester City, it’s not a surprise, and that’s why not very noticeable. But Arsenal and Brighton are performing significantly above expectations this season. Therefore, awarding them with an additional predictor seems rather reasonable.
The team value data refer to the theoretical, potential strength of a team.
To show the current, actual strength, let’s use already available data from the organizers. We calculate cumulative points for each team within the current season before every match.For each match, an additional predictor Point_STD_Diff is added:
Point_STD_Diff = Point_STD_Home - Point_STD_Away
Then we build the same models as before, but using two predictors.
Lin_Model_HS <- train(HS~Value_Ratio+Point_STD_Diff,data=TS_Value_Complete,method="lm",trControl=trainControl(method="cv",number=5))
Lin_Model_AS <- train(AS~Value_Ratio+Point_STD_Diff,data=TS_Value_Complete,method="lm",trControl=trainControl(method="cv",number=5))
Log_Model_WDL <- multinom(WDL ~ Value_Ratio + Point_STD_Diff, data = TS_Value_Complete)
Three new models have slightly better quality according to the statistical RMSE / Residual Deviance:
## Linear Regression
##
## 122914 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 98331, 98331, 98332, 98332, 98330
## Resampling results:
##
## RMSE Rsquared MAE
## 1.19578 0.0446394 0.9590288
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Linear Regression
##
## 122914 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 98332, 98331, 98331, 98332, 98330
## Resampling results:
##
## RMSE Rsquared MAE
## 1.077327 0.0318632 0.8389198
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Call:
## multinom(formula = WDL ~ Value_Ratio + Point_STD_Diff, data = TS_Value_Complete)
##
## Coefficients:
## (Intercept) Value_Ratio Point_STD_Diff
## L 0.008661911 -0.1200493 -0.01608641
## W 0.441590931 0.1565452 0.01631347
##
## Residual Deviance: 257039.7
## AIC: 257051.7
Additional validation is performed against the bookmaker’s predictions. The results are more promising:
To prove numerically, an average deviation (RMSE) with WDL ~ Value_Ratio:
## [1] 0.08398743
An average deviation (RMSE_P) with WDL ~ Value_Ratio + Point_STD_Diff:
## [1] 0.07443581
After these checks, I keep the new models and use them for predictions:
Another challenge which I tackled was to define what exact score to take.
Here is the distribution of most likely outcomes:These two pieces of data seem to be contradicting at first glance. On one hand, the likeliest outcome is a home win. On another, 1:1 as an exact score dominates over the others with 58% share. It means there are cases when we predicted W as WDL, but 1:1 as an exact score. Or maybe this is a valid behavior, and prediction of exact scores and WDL probabilities are two separate tasks?
To test this, we have to calculate an average weighted RMSE of each exact score and compare them.
I took three matches with various relative strengths between the teams:
The next step is to cross join this set of 13 scores against itself and calculate a relative RSME for each pair. It will allow us to all the possible combinations of an actual score and a predicted one (13 x 13 = 169 combinations).
RMSE shows how far exact scores are, one from another:
RMSE = sqrt(((HS-prd_HS)^2+(AS-prd_AS)^2)/2)
An interesting observation: RMSE (1:0 vs 3:0) = 1.41, while RMSE (1:0 vs 2:1) = 1.00. They are different, although deviation in the form of total number of goals is the same - 2.
This means, it’s better to spread deviation for both HS and AS (1:0 vs 2:1 case), rather than keeping them in one component only (1:0 vs 3:0 case). Let’s remember it for the future conclusions.
Now we group the data by the predicted scores and calculate weighted average values for RMSE. The smaller the value, the better score selection it is.
RMSE_avgw_ARSBOU = sum(Prob_ARSBOU*RMSE),
RMSE_avgw_LIVMUN = sum(Prob_LIVMUN*RMSE),
RMSE_avgw_EVEAVL = sum(Prob_EVEAVL*RMSE)
These are separate results for each match, sorted by the best score:
As you can see, the best one for Arsenal - Bournemouth is 2:0, tightly followed by 2:1 (predicted by the model).
For both Liverpool - Manchester Untied and Everton - Aston Villa, the best is 1:1 (also predicted by the model).
So our hypothesis is proven: 1:1 can be the score with potentially the lowest RMSE, even the selected probable outcome is W.
Explanation for this is the following. In the competition, we are eventually rewarded according to total/average RMSE across all the matches, and not by number of scores predicted precisely.
Another notice is that 1:1 is the score which lies ‘in the middle’. This position helps to smooth over the deviation effect: even if an exact score is wrong many times, the resulting RMSE will be lower.
The prediction table for exact scores:In order to predict exact scores and probabilities for win/draw/loss, I combined the data provided by the organizers and the data on team values from Transfermarkt.
After data harmonization, the predictive models were created, tested and used.
The most challenging parts of the whole exercise were: