Predicting Number of Crosses a Player Will Create in a Football Game
Predicting Number of Crosses a Player Will Create in a Football Game
1 Introduction
There are a lot of stuffs happen in every football match, a lot of statistics and data can be generated from just a single game. In this article, I will focus on the number of crosses, where I will try to create a model that predicts the number of crosses a player will create in a game based on various features. This article is part of Coding Challenge from CGrow Research.
Here is the data we will deal with, splitted into training set and testing set
data <- readRDS("../data/AllDt.rds")
TrainDate <- data$Date < "2020-12-01"
TrainDt <- data[TrainDate, ]
TestDt <- data[!TrainDate, ]
TrainDt[grep("[0-9]", TrainDt$Player), ]$Player <- "Reece James"
TrainDt2 Feature Selection
There are 5381 observations and 15 features in our training set. Some of the columns are redundant and/or will give less to no impact to our model performance, so we’ll have to remove them.
PlyID : The unique identifier for each player. We can either use this or the Player column, and for this analysis I will use the Player column, thus we can remove the PlyID column
Team : Most of the teams were managed by the same manager throughout the span. Watford were the only team that had changed their manager mid-season. Also it makes more sense to predict number of Crosses based on who the Manager is, since the manager was the one who arranged the tactical approach of a team.
TrainDt[!duplicated(TrainDt[, c(4, 7)]), ] %>% select(Team, Manager) %>% group_by(Team) %>%
summarise(total_manager = n()) %>% filter(total_manager > 1)- Competition : All of the observations come from the same competition (EPL)
EPL
5381
Team.Possession, Team.Goals, Opponent.Goals, Minutes
These columns can only be attained post-match or after the match has ended, while predictions are always done pre-match. Thus we can remove these columns as wellHmAw
This column is very insignificant. We can prove it by using the Chi-Square Test. If the p-value is smaller than 0.05, we can reject the hypothesis that the column is insignificant
Pearson's Chi-squared test
data: TrainDt$Crosses and TrainDt$HmAw
X-squared = 14.167, df = 14, p-value = 0.4373
With p-value larger than 0.05, we fail to reject the hypothesis, thus meaning that the variable is indeed insignificant
As a comparison, WinProb column is significant, as shown by the p-value from this Chi-Square test
Pearson's Chi-squared test
data: TrainDt$Crosses and TrainDt$WinProb
X-squared = 3116.9, df = 2968, p-value = 0.02811
Let’s now remove these variables from our data
TrainDt <- TrainDt %>% select(-c(PlyID, Team, Competition, Team.Possession, Team.Goals,
Opponent.Goals, HmAw, Minutes))
TrainDtOur data now consists of only 7 features, which we will analyze further.
3 Feature Engineering
Feature Engineering is a process of using domain knowledge to extract features from existing raw data to get more useful columns
In machine learning, it is not common to have a column with lots of levels. Usually these columns are either removed or engineered into new columns with fewer levels
[1] 555
[1] 24
[1] 23
Here in this data we have 555 different players, 24 different managers, and 23 different opposition teams. A feature engineering method we can use is splitting the data based on the Crosses column. So for example, we’re going to split the players into different levels, with each specifying how frequent a player creates crosses.
3.1 Player
The first column is the Player column. We set an assumption that certain players like to create crosses
Player.Crosses <- TrainDt %>% group_by(Date, Player, Opponent) %>% summarise(total_crosses = sum(Crosses)) %>%
group_by(Player) %>% summarise(Cross.Per.Game = mean(total_crosses), Games = n()) %>%
arrange(-Cross.Per.Game)
Player.CrossesHere we can see how many crosses a player create per game. We can also see a bit of bias for the players who has only played a few games. What we can do is setting a filter that every player should play at least n games, and here I’ll set the n to 5
Then we want to split the data into classes. The most efficient way is by using quantile to split the data into 4 different levels
0% 25% 50% 75% 100%
0.0000000 0.1000000 0.4285714 1.3480392 5.1250000
So ultimately we will have a list of players, each with their level of crossing frequency
Player.Crosses <- Player.Crosses %>% mutate(Player.CF = case_when(Cross.Per.Game >=
q[1] & Cross.Per.Game <= q[2] ~ 1, Cross.Per.Game > q[2] & Cross.Per.Game <=
q[3] ~ 2, Cross.Per.Game > q[3] & Cross.Per.Game <= q[4] ~ 3, Cross.Per.Game >
q[4] & Cross.Per.Game <= q[5] ~ 4))
Player.CrossesFinally we can merge this to our original data
3.2 Manager
It’s sensible to include this column, because some coaches do use crossing as their attacking approach.
Similar to the Player column, we will divide the Manager into different levels based on their crosses frequency. We’ll also filter the data with at least 5 games played.
Manager.Crosses <- TrainDt %>% group_by(Date, Manager) %>% summarise(total_crosses = sum(Crosses)) %>%
group_by(Manager) %>% summarise(Cross.Per.Game = mean(total_crosses), Games = n()) %>%
filter(Games > 5) %>% arrange(-Cross.Per.Game)Then we split by quantile
q <- quantile(Manager.Crosses$Cross.Per.Game)
Manager.Crosses <- Manager.Crosses %>% mutate(Manager.CF = case_when(Cross.Per.Game >=
q[1] & Cross.Per.Game <= q[2] ~ 1, Cross.Per.Game > q[2] & Cross.Per.Game <=
q[3] ~ 2, Cross.Per.Game > q[3] & Cross.Per.Game <= q[4] ~ 3, Cross.Per.Game >
q[4] & Cross.Per.Game <= q[5] ~ 4))
Manager.CrossesFinally we append this to our data
3.3 Opponent
Some opponent tends to concede a lot of crosses. For example, if they play with a back three instead of the classic 4 defenders, thus no wing-backs to prevent crosses to the box.
We’ll also apply the similar approach as the previous two columns to the Opponent column
Opponent.Crosses <- TrainDt %>% group_by(Date, Opponent) %>% summarise(cross_conceded = sum(Crosses)) %>%
group_by(Opponent) %>% summarise(Cross.Conceded.PG = mean(cross_conceded), Games = n()) %>%
filter(Games >= 5) %>% arrange(-Cross.Conceded.PG)
q <- quantile(Opponent.Crosses$Cross.Conceded.PG)
Opponent.Crosses <- Opponent.Crosses %>% mutate(Opponent.CF = case_when(Cross.Conceded.PG >=
q[1] & Cross.Conceded.PG <= q[2] ~ 1, Cross.Conceded.PG > q[2] & Cross.Conceded.PG <=
q[3] ~ 2, Cross.Conceded.PG > q[3] & Cross.Conceded.PG <= q[4] ~ 3, Cross.Conceded.PG >
q[4] & Cross.Conceded.PG <= q[5] ~ 4))
Opponent.Crosses3.4 PlyPosition
There are a lot of positions in football, but crosses can only be done from the wing. It’s a good assumption to say that wing players (left back, right back, left winger, etc) tend to create more crosses. Here we can split the PlyPosition column to a boolean which specifies whether the player plays on the wing or not, by observing the pattern. Wing Player has either “L” or “R” in their position, which defines “Left” and “Right” respectively.
for (i in 1:nrow(TrainDt)) {
if (length(grep("L|R", TrainDt$PlyPosition[i])) > 0) {
TrainDt$Wing.Player[i] = "Yes"
} else {
TrainDt$Wing.Player[i] = "No"
}
}Finally we can tidy the data a bit by removing the unused columns, change the data type, and arrange the order
TrainDt <- TrainDt %>% select(-c(Date, Manager, Opponent, Player, PlyPosition)) %>%
mutate(Player.CF = as.factor(Player.CF), Manager.CF = as.factor(Manager.CF),
Opponent.CF = as.factor(Opponent.CF), Wing.Player = as.factor(Wing.Player)) %>%
select(Manager.CF, Opponent.CF, WinProb, Player.CF, Wing.Player, Crosses)And here is our final data after all of the pre-processing
- Manager.CF : The frequency level of a manager to play with crossing approach
- Opponent.CF : The frequency level of the opponent to concede crosses
- WinProb : The probability of the team to win (based on betting sites)
- Player.CF : The frequency level of a player to create crosses
- Wing.Player : Whether the player plays on the wing or not
4 Response Variable
After dealing with the features (independent) variables, now we continue to the response (dependent) variable.
We can check the distribution to see the closest fit probability distribution
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
3065 940 481 252 142 90 61 23 20 9 6 5 1 2 1
By doing quick observation, we can see that the frequency decreasing as the number of crosses increases. A better view would be using graph
library(ggplot2)
data.dist <- as.data.frame(spline(as.data.frame(table(TrainDt$Crosses))))
ggplot(data.dist, aes(x, y)) + geom_line(col = "skyblue", size = 2) + theme_minimal()The closest distribution is the Zero-Inflated Poisson Distribution, especially since we have over 60% of zeros in our data
0 1 2 3 4 5
60.12161632 18.43860337 9.43507258 4.94311495 2.78540604 1.76539820
6 7 8 9 10 11
1.19654767 0.45115732 0.39231071 0.17653982 0.11769321 0.09807768
12 13 14
0.01961554 0.03923107 0.01961554
5 Modeling
Before we fit the model, we’ll have to apply the same preprocessing to our testing set
TestDt <- TestDt %>% select(-c(PlyID, Team, Competition, Team.Possession, Team.Goals,
Opponent.Goals, HmAw, Minutes)) %>% merge(Player.Crosses %>% select(Player, Player.CF)) %>%
merge(Manager.Crosses %>% select(Manager, Manager.CF)) %>% merge(Opponent.Crosses %>%
select(Opponent, Opponent.CF))
for (i in 1:nrow(TestDt)) {
if (length(grep("L|R", TestDt$PlyPosition[i])) > 0) {
TestDt$Wing.Player[i] = "Yes"
} else {
TestDt$Wing.Player[i] = "No"
}
}
TestDt <- TestDt %>% select(-c(Date, Manager, Opponent, Player, PlyPosition)) %>%
mutate(Player.CF = as.factor(Player.CF), Manager.CF = as.factor(Manager.CF),
Opponent.CF = as.factor(Opponent.CF), Wing.Player = as.factor(Wing.Player)) %>%
select(Manager.CF, Opponent.CF, WinProb, Player.CF, Wing.Player, Crosses)And now we can fit the model
5.1 Linear Regression
Call:
lm(formula = Crosses ~ ., data = TrainDt)
Residuals:
Min 1Q Median 3Q Max
-3.0341 -0.4960 -0.1103 0.2823 10.9659
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.43871 0.05892 -7.446 1.12e-13 ***
Manager.CF2 0.11826 0.04712 2.510 0.012108 *
Manager.CF3 0.19707 0.05266 3.742 0.000185 ***
Manager.CF4 0.23168 0.04809 4.817 1.50e-06 ***
Opponent.CF2 0.19399 0.05226 3.712 0.000208 ***
Opponent.CF3 0.29896 0.05128 5.829 5.91e-09 ***
Opponent.CF4 0.36562 0.05229 6.992 3.06e-12 ***
WinProb 0.25960 0.10288 2.523 0.011655 *
Player.CF2 0.21302 0.04843 4.399 1.11e-05 ***
Player.CF3 0.58434 0.05055 11.559 < 2e-16 ***
Player.CF4 1.87671 0.05650 33.213 < 2e-16 ***
Wing.PlayerYes 0.81837 0.04752 17.221 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.21 on 5086 degrees of freedom
Multiple R-squared: 0.4139, Adjusted R-squared: 0.4127
F-statistic: 326.6 on 11 and 5086 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
Manager.CF 1.104929 3 1.016769
Opponent.CF 1.341855 3 1.050230
WinProb 1.355819 1 1.164396
Player.CF 1.574134 3 1.078550
Wing.Player 1.520737 1 1.233181
Model Summary:
- Median residuals close to 0
- Every variable is significant
- Adjusted R-Squared of 0.4
- There are no multicollinearity between the variables
5.2 Zero-Inflated Poisson Regression
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.5013221 0.75793827 -5.938903 0.00000000
Manager.CF2 0.1717809 0.06345707 2.707041 0.00678858
Manager.CF3 0.2926650 0.06412026 4.564314 0.00000501
Manager.CF4 0.2747281 0.06332743 4.338216 0.00001436
Opponent.CF2 0.1604293 0.05956921 2.693158 0.00707787
Opponent.CF3 0.2763767 0.05629592 4.909356 0.00000091
Opponent.CF4 0.3461180 0.05616903 6.162078 0.00000000
WinProb 0.4170368 0.09912832 4.207040 0.00002587
Player.CF2 2.6501186 0.77952078 3.399677 0.00067466
Player.CF3 3.7596260 0.72059526 5.217389 0.00000018
Player.CF4 4.6277221 0.72221932 6.407641 0.00000000
Wing.PlayerYes 0.4103145 0.04172570 9.833616 0.00000000
We can see that every variable is significant, proven by the p-values below 0.05
5.3 Other Models
- K-Nearest Neighbor
- Decision Tree
6 Model Evaluation
We’ve created four different models, and now we can compare the performance of each. This function will return the Root Mean Square Error, Mean Absolute Error, and R-Squared from each model.
eval.model <- function(model) {
rmse <- MLmetrics::RMSE(predict(model, TestDt), TestDt$Crosses)
mae <- MLmetrics::MAE(predict(model, TestDt), TestDt$Crosses)
r2 <- MLmetrics::R2_Score(predict(model, TestDt), TestDt$Crosses)
return(c(rmse, mae, r2))
}lm.eval <- eval.model(lm)
zip.eval <- eval.model(zip)
knn.eval <- eval.model(knn)
dt.eval <- eval.model(dt)
data.frame(Model = c("Linear Regression", "Zero-Inflated Poisson Regression", "K-Nearest Neighbor",
"Decision Tree"), RMSE = c(lm.eval[1], zip.eval[1], knn.eval[1], dt.eval[1]),
MAE = c(lm.eval[2], zip.eval[2], knn.eval[2], dt.eval[2]), R.Squared = c(lm.eval[3],
zip.eval[3], knn.eval[3], dt.eval[3]))As we can see, Zero-Inflated Poisson Regression is the best model, with better performance in all RMSE, MAE, and R-Squared. This model is also very interpretable, which means we can specify which columns improve the performance the most and which columns are less significant.
7 Conclusion
7.1 Best Model
The Zero-Inflation Poisson Regression is the best performing model, because the response variable is a count, and most of the values are zeros
7.2 Model Evaluation
With MAE around 0.8, when we’re choosing one random observation, we’d expect the prediction to be around 0.8 lower or higher, which is a decent performance.
7.3 Coefficients
Since we’re using regression, we’re able to get the coefficients value from each feature and the p-value as well. A p-value below 0.05 signifies a useful column. The lower the p-value is, the more significant the feature is.
zip.info <- as.data.frame(summary(zip)$coefficients$count)
zip.info$column_name <- rownames(zip.info)
colnames(zip.info) <- c("Coefficients", "Standard_Error", "Z-Value", "p-value", "Feature")
zip.info <- zip.info %>% select(Feature, Coefficients, "p-value")
zip.info[order(zip.info$`p-value`), ]As we can see, the most important feature is Wing.Player, or whether the player plays on the wing or not such as left back and right wing forward. The second most important feature is Player.CF4, which specifies players with high interest to create crosses such as Reece James and Trent Alexander-Arnold. And the third most important feature is Opponent.CF4, which specifies opposition teams who tend to concede crosses such as Norwich and Burnley.
In conclusion, since all of the variables are statistically significant, we can say that the number of crosses created is highly influenced by the player himself, the manager, the opposition team, the win probability, and whether the player plays on the wing or not.