As one of the most popular sports in the world, football has long standing traditions. However for the majority of its history there was not a great way to properly and more importantly efficiently analyze its game play.
Traditionally there are a few reasons for this:
As we have moved into the 21st century and we have made some incredible advancements in processing power, the being able to properly tag and analyze football matches has become a much more manageable task in recent years.
For our analysis today we will be working with one of the largest open collection of football logs collected by Wyscout and hosted by Figshare.
This data contains of all the spatio-temporal events (passes, shots, fouls, etc.) that occur during all matches of an entire season of seven leagues/competitions:
For the analysis in this project we will be limiting the data to shots taken throughout the course of the game to try and understand how a players position on the field affects their probability of scoring.
For this analysis we will be using a subset of Premier League data set.
| Variable | Description |
|---|---|
| id | Unique Event ID |
| matchid | Unique Match ID |
| matchPeriod | What half the event takes place in |
| eventSec | How many seconds into the match the event occured |
| teamId | Unique Team ID |
| playerId | UNique Player ID |
| eventId | Event ID |
| eventName | Name of Event |
| subEventId | Subevent ID |
| subEventName | Subevent Name |
| y1 | Y coordinate of event |
| x1 | X coordinate of event |
| tags | tags describing event (in csv) |
| Goal | Whether the event was a goal |
| Type | What type of shot the event was |
| Counter | If the event was on a counter attack |
| Target | Whether the shot was on target |
| Position | Where on goal the shot was aimed |
| Post | Whether the shot hit the post |
| Type | OnTargetPerc |
|---|---|
| Head | 0.3776371 |
| Right | 0.2985751 |
| Left | 0.2763419 |
As we can see in the previous table, there is not a great difference between they types of shots but on average Headers are the most accurate type of shot.
This is probably due to the fact that they often occur much closer to the goal than the other types.
We also see that on average Right footed shots are slightly more accurate than Left footed shots.
This graphic describes the best places to shot the ball given the data we have. Each dot represents a different quadrant of the goal and their color represents the Percentage of shots that scored in these locations. So in this regard, the lighter the dot the more likely you are to score.
Break down of the actual shots vs goals.
| Goal | Counts |
|---|---|
| Goal | 293 |
| Shot | 2731 |
If all shots are predicted not to go in what would accuracy of model be?
| Accuracy |
|---|
| 0.9031085 |
Split the data into training and validation (.7/.3) Run Backwards Step Regression
##
## Call:
## glm(formula = Goal ~ Target + y1 + eventName + x1, family = binomial,
## data = train.goal)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.369 0.000 0.000 0.000 4.009
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -26.49637 677.55979 -0.039 0.9688
## Target 33.75065 967.10794 0.035 0.9722
## y1 0.15354 0.02625 5.850 4.92e-09 ***
## eventNameShot -17.03762 690.08150 -0.025 0.9803
## x1 0.02342 0.01062 2.206 0.0274 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1155.7 on 1814 degrees of freedom
## Residual deviance: 619.7 on 1810 degrees of freedom
## AIC: 629.7
##
## Number of Fisher Scoring iterations: 20
## [1] 1.094576e-114
| OptimalCuttoff |
|---|
| 0.4299999 |
| AUC |
|---|
| 0.9279805 |
| AUC | Accuracy | |
|---|---|---|
| Training | 0.9358083 | 0.9163 |
| Test | 0.9279805 | 0.9041 |
With a overall model accuracy of 90.41, The model is doing just better than blindly guessing a missed shot for every shot taken. However this model is showing that a players positioning on the pitch is a significant aspect to whether a shot goes in or not.
This model leaves a little to be desired but with incorporating more information about the means, matches and players, I believe you would be able to better equip the model to have a higher accuracy.