Introduction

Data Sources

As one of the most popular sports in the world, football has long standing traditions. However for the majority of its history there was not a great way to properly and more importantly efficiently analyze its game play.

Traditionally there are a few reasons for this:

  • Continuous Game Play
  • Fluid Player Positioning
  • No Tracking Systems
  • Scope of the Field
  • Possession not always obvious

As we have moved into the 21st century and we have made some incredible advancements in processing power, the being able to properly tag and analyze football matches has become a much more manageable task in recent years.

For our analysis today we will be working with one of the largest open collection of football logs collected by Wyscout and hosted by Figshare.

This data contains of all the spatio-temporal events (passes, shots, fouls, etc.) that occur during all matches of an entire season of seven leagues/competitions:

  • La Liga
  • Serie A
  • Bundesliga
  • Premier League
  • Ligue 1
  • FIFA World Cup 2018
  • UEFA Euro Cup 2016

Planned Analysis

For the analysis in this project we will be limiting the data to shots taken throughout the course of the game to try and understand how a players position on the field affects their probability of scoring.

For this analysis we will be using a subset of Premier League data set.

Data Dictionary

Variable Description
id Unique Event ID
matchid Unique Match ID
matchPeriod What half the event takes place in
eventSec How many seconds into the match the event occured
teamId Unique Team ID
playerId UNique Player ID
eventId Event ID
eventName Name of Event
subEventId Subevent ID
subEventName Subevent Name
y1 Y coordinate of event
x1 X coordinate of event
tags tags describing event (in csv)
Goal Whether the event was a goal
Type What type of shot the event was
Counter If the event was on a counter attack
Target Whether the shot was on target
Position Where on goal the shot was aimed
Post Whether the shot hit the post

Shot Type Accuracy

Question: Does the type of shot effect the accuracy of the shot? If so, what type of shot is more accurate?

Shot Accuracy and Location by Shot Type

Shot Type Accuracy

Type OnTargetPerc
Head 0.3776371
Right 0.2985751
Left 0.2763419

As we can see in the previous table, there is not a great difference between they types of shots but on average Headers are the most accurate type of shot.

This is probably due to the fact that they often occur much closer to the goal than the other types.

We also see that on average Right footed shots are slightly more accurate than Left footed shots.

Where to Aim

This graphic describes the best places to shot the ball given the data we have. Each dot represents a different quadrant of the goal and their color represents the Percentage of shots that scored in these locations. So in this regard, the lighter the dot the more likely you are to score.

Goal Probability

Break down of the actual shots vs goals.

Goal Counts
Goal 293
Shot 2731

If all shots are predicted not to go in what would accuracy of model be?

Accuracy
0.9031085

Split the data into training and validation (.7/.3) Run Backwards Step Regression

Model Summary

## 
## Call:
## glm(formula = Goal ~ Target + y1 + eventName + x1, family = binomial, 
##     data = train.goal)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.369   0.000   0.000   0.000   4.009  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -26.49637  677.55979  -0.039   0.9688    
## Target         33.75065  967.10794   0.035   0.9722    
## y1              0.15354    0.02625   5.850 4.92e-09 ***
## eventNameShot -17.03762  690.08150  -0.025   0.9803    
## x1              0.02342    0.01062   2.206   0.0274 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1155.7  on 1814  degrees of freedom
## Residual deviance:  619.7  on 1810  degrees of freedom
## AIC: 629.7
## 
## Number of Fisher Scoring iterations: 20
## [1] 1.094576e-114
OptimalCuttoff
0.4299999

AUC ROC Curve

AUC
0.9279805

Model Training and Testing Accuracy

AUC Accuracy
Training 0.9358083 0.9163
Test 0.9279805 0.9041

With a overall model accuracy of 90.41, The model is doing just better than blindly guessing a missed shot for every shot taken. However this model is showing that a players positioning on the pitch is a significant aspect to whether a shot goes in or not.

This model leaves a little to be desired but with incorporating more information about the means, matches and players, I believe you would be able to better equip the model to have a higher accuracy.

Sources

  • Pappalardo et al., (2019) A public data set of spatio-temporal match events in soccer competitions, Nature Scientific Data 6:236, https://www.nature.com/articles/s41597-019-0247-7
  • Pappalardo et al. (2019) PlayeRank: Data-driven Performance Evaluation and Player Ranking in Soccer via a Machine Learning Approach. ACM Transactions on Intellingent Systems and Technologies (TIST) 10, 5, Article 59 (September 2019), 27 pages. DOI: https://doi.org/10.1145/3343172