Data in sports has been well known because of stories encapsulated by journalistic writing, like seen in the book, Money ball. To sum up the plot, the book explores the success of using ‘unconventional strategies’ (I.e., usage of data in sports) used by Oakland Athletics baseball team who were on a shoe string budget. Here, there is a beacon of bright light shined on the promising power of data analysis and to challenge traditional methodical thinking in sports. My data set is from a website named FBREF and the data is collected by the organization named Opta. Here I will explore 2021-2022 possession statistics from professional players in the top five leagues in Europe.
Before taking courses from Montgomery College, I always had a profound passion to stay within the sport irregardless if I was an active participant on the field. My main interests were, what skills are appreciated by the sport and are accessible near me? That is when I discovered the data courses here at MC, to which I am glad to have been part of. That is only half of the answer, now this data is democratized for everyone to use through the FBREF site. In their about page, there is no coherent description of how the data was collected, but mentions that it uses Opta’s data. To add on, sports data is used which can be seen online through twitter, medium, broadcasting channels, and so on. This is me trying to familiarize myself with data in the sense that it will help me understand how to work with it and create sophisticated models for others to use.
# This function will load the packages, which are, tidymodels, tidyverse, and worldfootballR
library(tidymodels)
library(tidyverse)
library(worldfootballR)
library(DataExplorer)
library(ggfortify)
library(ggplot2)
library(ggrepel)
library(plotly)
# This function will read the csv file from the working directory and name it player in your global environment
player <- read_csv("big5_advanced_stats.csv")
# This function will allow us to take a peek of the first six observations
head(player)
## # A tibble: 6 × 32
## Season_End_Year Squad Comp Player Nation Pos Age Born Mins_Per_90
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2022 Alavés La Liga Martin Ag… ESP DF 25 1996 18
## 2 2022 Alavés La Liga Mircea Al… ROU DF 21 2000 0.2
## 3 2022 Alavés La Liga Rubén Dua… ESP DF 25 1995 29.1
## 4 2022 Alavés La Liga Gonzalo E… ARG MF 28 1993 14.4
## 5 2022 Alavés La Liga Manu Garc… ESP MF 23 1998 12
## 6 2022 Alavés La Liga Saúl Garc… ESP MF 26 1994 0
## # ℹ 23 more variables: Touches_Touches <dbl>, `Def Pen_Touches` <dbl>,
## # `Def 3rd_Touches` <dbl>, `Mid 3rd_Touches` <dbl>, `Att 3rd_Touches` <dbl>,
## # `Att Pen_Touches` <dbl>, Live_Touches <dbl>, Att_Take <dbl>,
## # Succ_Take <dbl>, Succ_percent_Take <dbl>, Tkld_Take <dbl>,
## # Tkld_percent_Take <dbl>, Carries_Carries <dbl>, TotDist_Carries <dbl>,
## # PrgDist_Carries <dbl>, PrgC_Carries <dbl>, Final_Third_Carries <dbl>,
## # CPA_Carries <dbl>, Mis_Carries <dbl>, Dis_Carries <dbl>, …
From the source, there was noreadme file, but there was a glossary to describe each variable.
season_end_year - If the end year was in 2020, then the full season would be 2019-2020, and if it was 2023, then the full season would be 2022-2023
squad - The name of the team
comp - The name of the competitive league
player - The name of the athlete
nation - The name of the nation from which the athlete is from
pos - The name of the position from which the athlete most commonly plays in
age - A numeric input for the athlete’s age for the start of the season (current age)
born - The year of birth for each athlete
mins_per_90 - Mathematically, the minutes played divided by 90
touches_touches - The number of times a player touched the ball (Note: Receiving a pass, then dribbling, then sending a pass counts as one touch)
def pen_touches - Touches in the defensive penalty area
def 3rd_touches - Touches in the defensive 1/3 of the pitch
mid 3rd_touches - Touches in the middle 1/3 of the pitch
att 3rd_touches - Touches in the attacking 1/3 of the pitch
live_touches - Live-ball touches (Note: Does not include corner kicks, free kicks, throw-ins, kick-offs, goal kicks or penalty kicks)
att_take - Number of attempts to take on defenders while dribbling
succ_take - Number of defenders taken on successfully, by dribbling past them (Note: Unsuccessful take-ons include attempts where the dribbler retained possession but was unable to get past the defender)
succ_percent_take - Percentage of take-ons completed successfully
tkld_take - Number of times tackled by a defender during a take on attempt
tkld_percent_take - Percentage of time tackled by a defender during a take-on attempt
carries_carries - Number of times the player controlled the ball with their feet
totdist_carries - The total distance, in yards, a player moved the ball while controlling it with their feet, in any direction
prgdist_carries - Progressive distance, in yards, a player moved the ball while controlling it with their feet towards the opponent’s goal
prgc_carries - Carries that move the ball towards the opponent’s goal line at least 10 yards from its furthest point in the last six passes, or any carry into the penalty area (Note: Excludes carries which end in the defending 50% of the pitch)
final_third_carries - Carries that enter the third of the pitch closest to the goal
cpa_carries - Carries into the 18-yard box
mis_carries - Number of times a player failed when attempting to gain control of a ball
dis_carries - Number of times a player loses control of the ball after being tackled by an opposing player (Note: Does not include attempted take-ons)
rec_receiving - Number of times a player successfully received a pass
prgr_receiving - [Progressive passes received] Completed passes that move the ball towards the opponent’s goal line at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area (Note: Excludes passes from the defending 40% of the pitch)
What are some quantitative variables?
How about some categorical variables?
From a glanced look, progressive carries (I.e., measure a player’s ability to advance the pitch with successful dribbles/runs) and the number of touches (I.e., measure the player’s ability to find space to have touches on the ball) will be worth exploring their relationship
# Be sure to pipe in the data from the player subset into ggplot
player |>
# In ggplot, set the aesthetics for x to equal to progressive carries
ggplot(aes(x = PrgC_Carries)) +
# This function will help us visualize the distribution for this performance metric
geom_bar()
The shape of this distribution appears to be left skewed and there are outl,iers that go above 200.
# Be sure to pipe in the data from the player subset into ggplot
player |>
# In ggplot, set the aesthetics for x to equal to touches
ggplot(aes(x = Touches_Touches)) +
# This function will help us visualize the distribution for this performance metric
geom_bar()
This shape looks the same as the first (I.e., left skewed) and there are outliers that go above 2000.
Removing Na’s from the dataset, as well as, setting up a criteria to clean the dataset
# Name the new subset prem and make sure to use the player subset to pipe in data
prem <- player |>
# Filter for comp to be Premier League only
filter(Comp == "Premier League") |>
# Filter for mins per 90 to be over 16
filter(Mins_Per_90 > 16) |>
# Remove goalkeepers from the the variable pos
filter(Pos != "GK")
# This function glimpses the first six observations of the new subset
head(prem)
## # A tibble: 6 × 32
## Season_End_Year Squad Comp Player Nation Pos Age Born Mins_Per_90
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2022 Arsenal Premier L… Gabri… BRA DF 23 1997 34
## 2 2022 Arsenal Premier L… Alexa… FRA FW 30 1991 19.8
## 3 2022 Arsenal Premier L… Marti… BRA FW,MF 20 2001 20.7
## 4 2022 Arsenal Premier L… Thoma… GHA MF 28 1993 22.5
## 5 2022 Arsenal Premier L… Bukay… ENG FW,MF 19 2001 33.1
## 6 2022 Arsenal Premier L… Emile… ENG MF,FW 21 2000 21.3
## # ℹ 23 more variables: Touches_Touches <dbl>, `Def Pen_Touches` <dbl>,
## # `Def 3rd_Touches` <dbl>, `Mid 3rd_Touches` <dbl>, `Att 3rd_Touches` <dbl>,
## # `Att Pen_Touches` <dbl>, Live_Touches <dbl>, Att_Take <dbl>,
## # Succ_Take <dbl>, Succ_percent_Take <dbl>, Tkld_Take <dbl>,
## # Tkld_percent_Take <dbl>, Carries_Carries <dbl>, TotDist_Carries <dbl>,
## # PrgDist_Carries <dbl>, PrgC_Carries <dbl>, Final_Third_Carries <dbl>,
## # CPA_Carries <dbl>, Mis_Carries <dbl>, Dis_Carries <dbl>, …
prem |>
ggplot(aes(x = PrgC_Carries, y = Touches_Touches))+
geom_point() +
theme_bw() +
theme_minimal(base_size = 12) +
geom_label_repel(aes(label=ifelse(PrgC_Carries > 63, as.character(Player),'')), box.padding = unit(0.40, 'lines'), min.segment.length = unit(0, 'lines')) +
labs(x = "Progressive Carries", y = "Touches", title = "Scatterplot, Progressive Carries to Touches", caption = "Source: FBREF")
From the data visualization we can infer that there is a slight positive association between touches and progressive carries. Not sure if there is correlation between the metrics, so a linear model and correlation coefficient is best to determine if that is true. There is one outlier way above 3000 touches and at roughly 125 progressive carries.
# This function calculates the correlation coefficient
cor(prem$PrgC_Carries, prem$Touches_Touches)
## [1] 0.06768441
# Make a new subset for the linear model and name it fit1. Then set the syntax to Y ~ X, and make sure to set the data to equal to prem
fit1 <- lm(Touches_Touches ~ PrgC_Carries, data = prem)
# This function calculates the summary for the linear model created
summary(fit1)
##
## Call:
## lm(formula = Touches_Touches ~ PrgC_Carries, data = prem)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1048.0 -431.6 -89.7 305.7 2205.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1412.442 66.353 21.287 <2e-16 ***
## PrgC_Carries 1.206 1.179 1.022 0.308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 585.3 on 227 degrees of freedom
## Multiple R-squared: 0.004581, Adjusted R-squared: 0.0001961
## F-statistic: 1.045 on 1 and 227 DF, p-value: 0.3078
The chunk above shows two outputs and one stands for “correlation” and the other can build an equation which can be interpreted for good use. For the correlation value, the number is between -1 and 1, and the purpose is to indicate how strong or weak the correlation is. For example, values close to +/- 1 are considered as a strong correlation, values closer to +/-0.5 are considered as a weak correlation, and values closer to zero have no correlation.
The model above can be formed into an equation:
Touches_Touches = 1412 +
1.206(PrgC_Carries)
Here is the interpretation of the slope; For each additional progressive carry, there is a predicted increase of 1.206 in touches
Another aspect to interpret in terms of the model is the p-value: The reference to look at is if it is above or less than 0.05, which in this case it is 0.3078. In other words, because the p-value is above it would suggest this is not a meaningful variable to explain the linear increase in touches
Lastly, the adjusted r-squared value states that about 0.02% of the observations may be explained by this model. To add on that note, roughly 99.98% of the variation in the data is not explained by the use of this model.
# Make a new subset to test for correlation coefficients and name it cor_test
cor_test <- prem |>
# Select for predictor variables that are quantitative to predict touches
select(Touches_Touches, `Def Pen_Touches`, `Def 3rd_Touches`, `Mid 3rd_Touches`, `Att 3rd_Touches`, `Att Pen_Touches`, Live_Touches, Carries_Carries, TotDist_Carries, PrgDist_Carries, Att_Take, Succ_Take, Tkld_Take, PrgC_Carries, Rec_Receiving, PrgR_Receiving, Final_Third_Carries, CPA_Carries)
# Remove live touches from the subset cor_test
cor_test <- cor_test |>
select(-Live_Touches)
# This function will explore all the correlation coefficients from the selected predictor variables
plot_correlation(cor_test)
(Note: From the first exploration of correlations) Remove live touches, there shows a perfect correlation coefficient of 1, but would not make sense to include into a multiple regression analysis.
# Make a new subset for the second version of the model and make sure to use correlations that are weak or strong (Note: Disregard those that are close to a value of zero)
fit2 <- lm(Touches_Touches ~ `Def 3rd_Touches` + `Mid 3rd_Touches` + `Def Pen_Touches` + Carries_Carries + TotDist_Carries + PrgDist_Carries + Rec_Receiving, data = prem)
# This function will display the output from the model
summary(fit2)
##
## Call:
## lm(formula = Touches_Touches ~ `Def 3rd_Touches` + `Mid 3rd_Touches` +
## `Def Pen_Touches` + Carries_Carries + TotDist_Carries + PrgDist_Carries +
## Rec_Receiving, data = prem)
##
## Residuals:
## Min 1Q Median 3Q Max
## -243.40 -78.23 -4.41 52.30 392.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 216.65149 21.65692 10.004 < 2e-16 ***
## `Def 3rd_Touches` 0.66821 0.10911 6.124 4.12e-09 ***
## `Mid 3rd_Touches` 0.49657 0.05778 8.594 1.54e-15 ***
## `Def Pen_Touches` -1.09641 0.32665 -3.357 0.000929 ***
## Carries_Carries 0.75192 0.23533 3.195 0.001601 **
## TotDist_Carries 0.04496 0.02079 2.162 0.031670 *
## PrgDist_Carries -0.11550 0.02524 -4.577 7.88e-06 ***
## Rec_Receiving 0.18637 0.14971 1.245 0.214513
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115.9 on 221 degrees of freedom
## Multiple R-squared: 0.962, Adjusted R-squared: 0.9608
## F-statistic: 799.1 on 7 and 221 DF, p-value: < 2.2e-16
# This function will display diagnostic plots based from the model
autoplot(fit2, 1:4, nrow = 2, ncol = 2)
(Note: The adjusted R-squared value is 96.08%) By looking at the
p-value, the only variable that might not be contributing to this model
is Rec_Receiving. By dropping that variable and making a new model there
is chance to improve the adj R-squared value.
There are potential outliers that are perhaps skewing the variance distribution. For example, observations such as 185 and 118 appear three times in all four plots.
# Make a new subset for the third version of the model and remove Rec_Receiving
fit3 <- lm(data = prem, Touches_Touches ~ `Def 3rd_Touches` + `Mid 3rd_Touches` + `Def Pen_Touches` + Carries_Carries + TotDist_Carries + PrgDist_Carries)
# This function will display the output from the model
summary(fit3)
##
## Call:
## lm(formula = Touches_Touches ~ `Def 3rd_Touches` + `Mid 3rd_Touches` +
## `Def Pen_Touches` + Carries_Carries + TotDist_Carries + PrgDist_Carries,
## data = prem)
##
## Residuals:
## Min 1Q Median 3Q Max
## -269.41 -77.58 -7.41 52.40 400.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 219.60177 21.55348 10.189 < 2e-16 ***
## `Def 3rd_Touches` 0.62274 0.10294 6.049 6.11e-09 ***
## `Mid 3rd_Touches` 0.50484 0.05747 8.784 4.32e-16 ***
## `Def Pen_Touches` -1.02558 0.32205 -3.185 0.00166 **
## Carries_Carries 1.02193 0.09139 11.181 < 2e-16 ***
## TotDist_Carries 0.03496 0.01920 1.821 0.06999 .
## PrgDist_Carries -0.11385 0.02523 -4.512 1.04e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 116 on 222 degrees of freedom
## Multiple R-squared: 0.9617, Adjusted R-squared: 0.9607
## F-statistic: 929.8 on 6 and 222 DF, p-value: < 2.2e-16
# This function will display diagnostic plots based from the model
autoplot(fit3, 1:4, nrow = 2, ncol = 2)
After dropping Rec_Receiving, the model’s adjusted R-squared value decreased by 0.01%. Because of that, the model may be very similar to each other, so dropping Rec_Receiving does not create drastic issues.
The diagnostic plots look okay, but this time it is observations 35, 118, and 185 that are potential outliers that are skewing the data.
35 is Pascal Grob from Brighton – 185 is James-Ward-Prowse from Southampton – and 118 is Trent Alexander-Arnold from Liverpool
# Make a new subset and name it model_prem. Then, remove observations 35, 118, and 185
model_prem <- prem[-c(35,118,185),]
# Make a new subset for the third version of the model and remove Rec_Receiving
fit4 <- lm(data = model_prem, Touches_Touches ~ `Def 3rd_Touches` + `Mid 3rd_Touches` + `Def Pen_Touches` + Carries_Carries + TotDist_Carries + PrgDist_Carries)
# This function will display the output from the model
summary(fit4)
##
## Call:
## lm(formula = Touches_Touches ~ `Def 3rd_Touches` + `Mid 3rd_Touches` +
## `Def Pen_Touches` + Carries_Carries + TotDist_Carries + PrgDist_Carries,
## data = model_prem)
##
## Residuals:
## Min 1Q Median 3Q Max
## -266.18 -73.43 -8.54 57.60 370.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 225.17176 20.54175 10.962 < 2e-16 ***
## `Def 3rd_Touches` 0.61875 0.09836 6.290 1.70e-09 ***
## `Mid 3rd_Touches` 0.49198 0.05518 8.916 < 2e-16 ***
## `Def Pen_Touches` -0.95816 0.30822 -3.109 0.00213 **
## Carries_Carries 0.96875 0.08803 11.005 < 2e-16 ***
## TotDist_Carries 0.04063 0.01832 2.218 0.02757 *
## PrgDist_Carries -0.10766 0.02402 -4.483 1.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 110.3 on 219 degrees of freedom
## Multiple R-squared: 0.9643, Adjusted R-squared: 0.9633
## F-statistic: 986.3 on 6 and 219 DF, p-value: < 2.2e-16
# This function will display diagnostic plots based from the model
autoplot(fit4)
To review the new model, the adjusted R-squared value increased to 96.33%, and thus, this an impressive improvement. One thing to notice, the residual plot still appears to not be improved after making the changes.
# This function calculates the analysis of variance, and thus, compares the two models
anova(fit3, fit2)
## Analysis of Variance Table
##
## Model 1: Touches_Touches ~ `Def 3rd_Touches` + `Mid 3rd_Touches` + `Def Pen_Touches` +
## Carries_Carries + TotDist_Carries + PrgDist_Carries
## Model 2: Touches_Touches ~ `Def 3rd_Touches` + `Mid 3rd_Touches` + `Def Pen_Touches` +
## Carries_Carries + TotDist_Carries + PrgDist_Carries + Rec_Receiving
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 222 2989530
## 2 221 2968714 1 20816 1.5496 0.2145
The main value of interest is the p-value. Were there is no compelling evidence that Rec_Receiving contributes significantly to the model
The third model is the best model to predict touches based on conduction of the anova test. The p-value is large, so it is to believe that the variable dropped was in fact the right choice. In regards to the first scatter plot, there may be no correlation, but it is worth exploring when considering the metrics from a sporting perspective.
Now the points are defined for the players, but perhaps labels for the players would be best to visualize where they placed in the scatter plot. Since there are a lot of players to label, which would lead to some clashing and overlapping with each other. The way to mitigate this issues would be to only show labels for those who are the best performance in terms of their underlying metrics.
# This function will provide statistics to calculate the mean
summary(prem$PrgC_Carries)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 20.00 40.00 45.71 63.00 180.00
summary(prem$Touches_Touches)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 409 1052 1387 1468 1770 3769
p1 <- prem |>
ggplot(aes(x = PrgC_Carries, y = Touches_Touches, color = Age, text = paste("Name:", Player))) +
geom_point() +
scale_color_distiller(palette = "Reds") +
theme_bw() +
theme_minimal(base_size = 12) +
geom_vline(xintercept = 45.71) +
geom_hline(yintercept = 1468) +
labs(x = "Progressive Carries", y = "Touches", title = "Scatterplot between Progressive Carries to Touches", caption = "Source: FBREF", subtitle = "From 2021 - 2022 Competitive Premier League Season")
p1 <- ggplotly(p1)
p1
In the first visualization, we have found there to be no correlation between both performance metrics (predictor variables). Although, through calculating the mean, we set lines to reference four quadrants and define them as such, the top right is the category for players who perform high in respect to the performance metrics chosen. Those team with the most players on top is Manchester City, Joao Cancelo, Bernardo Silva, and Aymeric Laporte to name a few.
# This function will provide statistics to calculate the mean
summary(prem$`Mid 3rd_Touches`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 146.0 432.0 617.0 681.8 828.0 1985.0
p2 <- prem |>
ggplot(aes(x = `Mid 3rd_Touches`, y = Touches_Touches, color = Age, text = paste("Name:", Player))) +
geom_point() +
scale_color_distiller(palette = "Blues", direction = 1) +
theme_bw() +
theme_minimal(base_size = 12) +
labs(x = "Touches (Middle 1/3 of the Pitch)",
y = "Touches (Overall)",
title = "The Ball Was Where?",
caption = "Source: FBREF",
subtitle = "From 2021-2022 Competitive Premier League Season", color = "Minutes Played (Per 90)")
p2 <- p2 + geom_abline(slope = coef(lm(Touches_Touches ~ `Mid 3rd_Touches`, data = prem))[2], intercept = coef(lm(Touches_Touches ~ `Mid 3rd_Touches`, data = prem))[1], linetype = "dashed", color = "black")
ggplotly(p2)
In the second visualization, there is strong positive correlation between touches in the middle third and touches overall on the ball. To showcase the linear model made from earlier, there is a smoother in the interactive plot. Basically if a point is above the smoother then it is a good indication from the perspective of the model. In other words, those players are worth looking into once you ‘crunch the numbers’.
Based on surfing the web and looking at data usage from theathletic. An article named, “How many touches should a forward have in a game of football?” explores how broadcasters are using statistics. But the key question is, when seeing such statistics – is that good or bad? My goal was to find correlations and make a linear model to see how players perform in respect to their performance metrics. Through all the hard work and dedication, Joao Cancelo remained supreme on both visualizations. From a data perspective, he ticks a lot of boxes and would be worth exploring in terms of a recruitment for a team.
Carey, Mark. “How Many Touches Should a Forward Have in a Game of Football?” The Athletic, 8 Oct. 2021, theathletic.com/2871168/2021/10/08/how-many-touches-should-a-forward-have-in-a-game-of-football/.