Predicting Number of Crosses a Player Will Create in a Football Game

1 Introduction
2 Feature Selection
3 Feature Engineering
4 Response Variable
5 Modeling
6 Model Evaluation
7 Conclusion

1 Introduction

There are a lot of stuffs happen in every football match, a lot of statistics and data can be generated from just a single game. In this article, I will focus on the number of crosses, where I will try to create a model that predicts the number of crosses a player will create in a game based on various features. This article is part of Coding Challenge from CGrow Research.

Here is the data we will deal with, splitted into training set and testing set

data <- readRDS("../data/AllDt.rds")
TrainDate <- data$Date < "2020-12-01"
TrainDt <- data[TrainDate, ]
TestDt <- data[!TrainDate, ]
TrainDt[grep("[0-9]", TrainDt$Player), ]$Player <- "Reece James"
TrainDt

2 Feature Selection

There are 5381 observations and 15 features in our training set. Some of the columns are redundant and/or will give less to no impact to our model performance, so we’ll have to remove them.

PlyID : The unique identifier for each player. We can either use this or the Player column, and for this analysis I will use the Player column, thus we can remove the PlyID column
Team : Most of the teams were managed by the same manager throughout the span. Watford were the only team that had changed their manager mid-season. Also it makes more sense to predict number of Crosses based on who the Manager is, since the manager was the one who arranged the tactical approach of a team.

TrainDt[!duplicated(TrainDt[, c(4, 7)]), ] %>% select(Team, Manager) %>% group_by(Team) %>% 
    summarise(total_manager = n()) %>% filter(total_manager > 1)

Competition : All of the observations come from the same competition (EPL)

table(TrainDt$Competition)


 EPL 
5381

Team.Possession, Team.Goals, Opponent.Goals, Minutes
These columns can only be attained post-match or after the match has ended, while predictions are always done pre-match. Thus we can remove these columns as well
HmAw
This column is very insignificant. We can prove it by using the Chi-Square Test. If the p-value is smaller than 0.05, we can reject the hypothesis that the column is insignificant

chisq.test(TrainDt$Crosses, TrainDt$HmAw)


    Pearson's Chi-squared test

data:  TrainDt$Crosses and TrainDt$HmAw
X-squared = 14.167, df = 14, p-value = 0.4373

With p-value larger than 0.05, we fail to reject the hypothesis, thus meaning that the variable is indeed insignificant

As a comparison, WinProb column is significant, as shown by the p-value from this Chi-Square test

chisq.test(TrainDt$Crosses, TrainDt$WinProb)


    Pearson's Chi-squared test

data:  TrainDt$Crosses and TrainDt$WinProb
X-squared = 3116.9, df = 2968, p-value = 0.02811

Let’s now remove these variables from our data

TrainDt <- TrainDt %>% select(-c(PlyID, Team, Competition, Team.Possession, Team.Goals, 
    Opponent.Goals, HmAw, Minutes))
TrainDt

Our data now consists of only 7 features, which we will analyze further.

3 Feature Engineering

Feature Engineering is a process of using domain knowledge to extract features from existing raw data to get more useful columns

In machine learning, it is not common to have a column with lots of levels. Usually these columns are either removed or engineered into new columns with fewer levels

print(length(unique(TrainDt$Player)))

[1] 555

print(length(unique(TrainDt$Manager)))

[1] 24

print(length(unique(TrainDt$Opponent)))

[1] 23

Here in this data we have 555 different players, 24 different managers, and 23 different opposition teams. A feature engineering method we can use is splitting the data based on the Crosses column. So for example, we’re going to split the players into different levels, with each specifying how frequent a player creates crosses.

3.1 Player

The first column is the Player column. We set an assumption that certain players like to create crosses

Player.Crosses <- TrainDt %>% group_by(Date, Player, Opponent) %>% summarise(total_crosses = sum(Crosses)) %>% 
    group_by(Player) %>% summarise(Cross.Per.Game = mean(total_crosses), Games = n()) %>% 
    arrange(-Cross.Per.Game)

Player.Crosses

Here we can see how many crosses a player create per game. We can also see a bit of bias for the players who has only played a few games. What we can do is setting a filter that every player should play at least n games, and here I’ll set the n to 5

Player.Crosses <- Player.Crosses %>% filter(Games >= 5)

Then we want to split the data into classes. The most efficient way is by using quantile to split the data into 4 different levels

q <- quantile(Player.Crosses$Cross.Per.Game)
q

       0%       25%       50%       75%      100% 
0.0000000 0.1000000 0.4285714 1.3480392 5.1250000

So ultimately we will have a list of players, each with their level of crossing frequency

Player.Crosses <- Player.Crosses %>% mutate(Player.CF = case_when(Cross.Per.Game >= 
    q[1] & Cross.Per.Game <= q[2] ~ 1, Cross.Per.Game > q[2] & Cross.Per.Game <= 
    q[3] ~ 2, Cross.Per.Game > q[3] & Cross.Per.Game <= q[4] ~ 3, Cross.Per.Game > 
    q[4] & Cross.Per.Game <= q[5] ~ 4))

Player.Crosses

Finally we can merge this to our original data

TrainDt <- TrainDt %>% merge(Player.Crosses %>% select(Player, Player.CF))
TrainDt

3.2 Manager

It’s sensible to include this column, because some coaches do use crossing as their attacking approach.

Similar to the Player column, we will divide the Manager into different levels based on their crosses frequency. We’ll also filter the data with at least 5 games played.

Manager.Crosses <- TrainDt %>% group_by(Date, Manager) %>% summarise(total_crosses = sum(Crosses)) %>% 
    group_by(Manager) %>% summarise(Cross.Per.Game = mean(total_crosses), Games = n()) %>% 
    filter(Games > 5) %>% arrange(-Cross.Per.Game)

Then we split by quantile

q <- quantile(Manager.Crosses$Cross.Per.Game)

Manager.Crosses <- Manager.Crosses %>% mutate(Manager.CF = case_when(Cross.Per.Game >= 
    q[1] & Cross.Per.Game <= q[2] ~ 1, Cross.Per.Game > q[2] & Cross.Per.Game <= 
    q[3] ~ 2, Cross.Per.Game > q[3] & Cross.Per.Game <= q[4] ~ 3, Cross.Per.Game > 
    q[4] & Cross.Per.Game <= q[5] ~ 4))

Manager.Crosses

Finally we append this to our data

TrainDt <- TrainDt %>% merge(Manager.Crosses %>% select(Manager, Manager.CF))

3.3 Opponent

Some opponent tends to concede a lot of crosses. For example, if they play with a back three instead of the classic 4 defenders, thus no wing-backs to prevent crosses to the box.

We’ll also apply the similar approach as the previous two columns to the Opponent column

Opponent.Crosses <- TrainDt %>% group_by(Date, Opponent) %>% summarise(cross_conceded = sum(Crosses)) %>% 
    group_by(Opponent) %>% summarise(Cross.Conceded.PG = mean(cross_conceded), Games = n()) %>% 
    filter(Games >= 5) %>% arrange(-Cross.Conceded.PG)

q <- quantile(Opponent.Crosses$Cross.Conceded.PG)

Opponent.Crosses <- Opponent.Crosses %>% mutate(Opponent.CF = case_when(Cross.Conceded.PG >= 
    q[1] & Cross.Conceded.PG <= q[2] ~ 1, Cross.Conceded.PG > q[2] & Cross.Conceded.PG <= 
    q[3] ~ 2, Cross.Conceded.PG > q[3] & Cross.Conceded.PG <= q[4] ~ 3, Cross.Conceded.PG > 
    q[4] & Cross.Conceded.PG <= q[5] ~ 4))

Opponent.Crosses

TrainDt <- TrainDt %>% merge(Opponent.Crosses %>% select(Opponent, Opponent.CF))

3.4 PlyPosition

There are a lot of positions in football, but crosses can only be done from the wing. It’s a good assumption to say that wing players (left back, right back, left winger, etc) tend to create more crosses. Here we can split the PlyPosition column to a boolean which specifies whether the player plays on the wing or not, by observing the pattern. Wing Player has either “L” or “R” in their position, which defines “Left” and “Right” respectively.

for (i in 1:nrow(TrainDt)) {
    if (length(grep("L|R", TrainDt$PlyPosition[i])) > 0) {
        TrainDt$Wing.Player[i] = "Yes"
    } else {
        TrainDt$Wing.Player[i] = "No"
    }
}

Finally we can tidy the data a bit by removing the unused columns, change the data type, and arrange the order

TrainDt <- TrainDt %>% select(-c(Date, Manager, Opponent, Player, PlyPosition)) %>% 
    mutate(Player.CF = as.factor(Player.CF), Manager.CF = as.factor(Manager.CF), 
        Opponent.CF = as.factor(Opponent.CF), Wing.Player = as.factor(Wing.Player)) %>% 
    select(Manager.CF, Opponent.CF, WinProb, Player.CF, Wing.Player, Crosses)

And here is our final data after all of the pre-processing

TrainDt

Manager.CF : The frequency level of a manager to play with crossing approach
Opponent.CF : The frequency level of the opponent to concede crosses
WinProb : The probability of the team to win (based on betting sites)
Player.CF : The frequency level of a player to create crosses
Wing.Player : Whether the player plays on the wing or not

4 Response Variable

After dealing with the features (independent) variables, now we continue to the response (dependent) variable.

We can check the distribution to see the closest fit probability distribution

table(TrainDt$Crosses)


   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
3065  940  481  252  142   90   61   23   20    9    6    5    1    2    1

By doing quick observation, we can see that the frequency decreasing as the number of crosses increases. A better view would be using graph

library(ggplot2)
data.dist <- as.data.frame(spline(as.data.frame(table(TrainDt$Crosses))))
ggplot(data.dist, aes(x, y)) + geom_line(col = "skyblue", size = 2) + theme_minimal()

The closest distribution is the Zero-Inflated Poisson Distribution, especially since we have over 60% of zeros in our data

prop.table(table(TrainDt$Crosses)) * 100


          0           1           2           3           4           5 
60.12161632 18.43860337  9.43507258  4.94311495  2.78540604  1.76539820 
          6           7           8           9          10          11 
 1.19654767  0.45115732  0.39231071  0.17653982  0.11769321  0.09807768 
         12          13          14 
 0.01961554  0.03923107  0.01961554

5 Modeling

Before we fit the model, we’ll have to apply the same preprocessing to our testing set

TestDt <- TestDt %>% select(-c(PlyID, Team, Competition, Team.Possession, Team.Goals, 
    Opponent.Goals, HmAw, Minutes)) %>% merge(Player.Crosses %>% select(Player, Player.CF)) %>% 
    merge(Manager.Crosses %>% select(Manager, Manager.CF)) %>% merge(Opponent.Crosses %>% 
    select(Opponent, Opponent.CF))

for (i in 1:nrow(TestDt)) {
    if (length(grep("L|R", TestDt$PlyPosition[i])) > 0) {
        TestDt$Wing.Player[i] = "Yes"
    } else {
        TestDt$Wing.Player[i] = "No"
    }
}

TestDt <- TestDt %>% select(-c(Date, Manager, Opponent, Player, PlyPosition)) %>% 
    mutate(Player.CF = as.factor(Player.CF), Manager.CF = as.factor(Manager.CF), 
        Opponent.CF = as.factor(Opponent.CF), Wing.Player = as.factor(Wing.Player)) %>% 
    select(Manager.CF, Opponent.CF, WinProb, Player.CF, Wing.Player, Crosses)

And now we can fit the model

5.1 Linear Regression

lm <- lm(Crosses ~ ., TrainDt)
summary(lm)


Call:
lm(formula = Crosses ~ ., data = TrainDt)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0341 -0.4960 -0.1103  0.2823 10.9659 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.43871    0.05892  -7.446 1.12e-13 ***
Manager.CF2     0.11826    0.04712   2.510 0.012108 *  
Manager.CF3     0.19707    0.05266   3.742 0.000185 ***
Manager.CF4     0.23168    0.04809   4.817 1.50e-06 ***
Opponent.CF2    0.19399    0.05226   3.712 0.000208 ***
Opponent.CF3    0.29896    0.05128   5.829 5.91e-09 ***
Opponent.CF4    0.36562    0.05229   6.992 3.06e-12 ***
WinProb         0.25960    0.10288   2.523 0.011655 *  
Player.CF2      0.21302    0.04843   4.399 1.11e-05 ***
Player.CF3      0.58434    0.05055  11.559  < 2e-16 ***
Player.CF4      1.87671    0.05650  33.213  < 2e-16 ***
Wing.PlayerYes  0.81837    0.04752  17.221  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.21 on 5086 degrees of freedom
Multiple R-squared:  0.4139,    Adjusted R-squared:  0.4127 
F-statistic: 326.6 on 11 and 5086 DF,  p-value: < 2.2e-16

car::vif(lm)

                GVIF Df GVIF^(1/(2*Df))
Manager.CF  1.104929  3        1.016769
Opponent.CF 1.341855  3        1.050230
WinProb     1.355819  1        1.164396
Player.CF   1.574134  3        1.078550
Wing.Player 1.520737  1        1.233181

Model Summary:
- Median residuals close to 0
- Every variable is significant
- Adjusted R-Squared of 0.4
- There are no multicollinearity between the variables

5.2 Zero-Inflated Poisson Regression

zip <- zeroinfl(Crosses ~ ., TrainDt, dist = "poisson")
round(summary(zip)$coefficients$count, 8)

                 Estimate Std. Error   z value   Pr(>|z|)
(Intercept)    -4.5013221 0.75793827 -5.938903 0.00000000
Manager.CF2     0.1717809 0.06345707  2.707041 0.00678858
Manager.CF3     0.2926650 0.06412026  4.564314 0.00000501
Manager.CF4     0.2747281 0.06332743  4.338216 0.00001436
Opponent.CF2    0.1604293 0.05956921  2.693158 0.00707787
Opponent.CF3    0.2763767 0.05629592  4.909356 0.00000091
Opponent.CF4    0.3461180 0.05616903  6.162078 0.00000000
WinProb         0.4170368 0.09912832  4.207040 0.00002587
Player.CF2      2.6501186 0.77952078  3.399677 0.00067466
Player.CF3      3.7596260 0.72059526  5.217389 0.00000018
Player.CF4      4.6277221 0.72221932  6.407641 0.00000000
Wing.PlayerYes  0.4103145 0.04172570  9.833616 0.00000000

We can see that every variable is significant, proven by the p-values below 0.05

5.3 Other Models

K-Nearest Neighbor

knn <- knnreg(Crosses ~ ., TrainDt, k = round(sqrt(nrow(TrainDt))))

Decision Tree

dt <- ctree(Crosses ~ ., TrainDt)

6 Model Evaluation

We’ve created four different models, and now we can compare the performance of each. This function will return the Root Mean Square Error, Mean Absolute Error, and R-Squared from each model.

eval.model <- function(model) {
    rmse <- MLmetrics::RMSE(predict(model, TestDt), TestDt$Crosses)
    mae <- MLmetrics::MAE(predict(model, TestDt), TestDt$Crosses)
    r2 <- MLmetrics::R2_Score(predict(model, TestDt), TestDt$Crosses)
    return(c(rmse, mae, r2))
}

lm.eval <- eval.model(lm)
zip.eval <- eval.model(zip)
knn.eval <- eval.model(knn)
dt.eval <- eval.model(dt)
data.frame(Model = c("Linear Regression", "Zero-Inflated Poisson Regression", "K-Nearest Neighbor", 
    "Decision Tree"), RMSE = c(lm.eval[1], zip.eval[1], knn.eval[1], dt.eval[1]), 
    MAE = c(lm.eval[2], zip.eval[2], knn.eval[2], dt.eval[2]), R.Squared = c(lm.eval[3], 
        zip.eval[3], knn.eval[3], dt.eval[3]))

As we can see, Zero-Inflated Poisson Regression is the best model, with better performance in all RMSE, MAE, and R-Squared. This model is also very interpretable, which means we can specify which columns improve the performance the most and which columns are less significant.

7 Conclusion

7.1 Best Model

The Zero-Inflation Poisson Regression is the best performing model, because the response variable is a count, and most of the values are zeros

7.2 Model Evaluation

With MAE around 0.8, when we’re choosing one random observation, we’d expect the prediction to be around 0.8 lower or higher, which is a decent performance.

7.3 Coefficients

Since we’re using regression, we’re able to get the coefficients value from each feature and the p-value as well. A p-value below 0.05 signifies a useful column. The lower the p-value is, the more significant the feature is.

zip.info <- as.data.frame(summary(zip)$coefficients$count)
zip.info$column_name <- rownames(zip.info)
colnames(zip.info) <- c("Coefficients", "Standard_Error", "Z-Value", "p-value", "Feature")
zip.info <- zip.info %>% select(Feature, Coefficients, "p-value")
zip.info[order(zip.info$`p-value`), ]

As we can see, the most important feature is Wing.Player, or whether the player plays on the wing or not such as left back and right wing forward. The second most important feature is Player.CF4, which specifies players with high interest to create crosses such as Reece James and Trent Alexander-Arnold. And the third most important feature is Opponent.CF4, which specifies opposition teams who tend to concede crosses such as Norwich and Burnley.

In conclusion, since all of the variables are statistically significant, we can say that the number of crosses created is highly influenced by the player himself, the manager, the opposition team, the win probability, and whether the player plays on the wing or not.