Luka Football Metrics Second Iteration

Question 1

devtools::install_github("JaseZiv/worldfootballR")
library(worldfootballR)

passing_stats <- function(country) {
  stats <- fb_season_team_stats(country = country, gender = "M", season_end_year = 2024, tier = "1st", stat_type = "passing")
  clean_stats <- subset(stats, Team_or_Opponent == "team")
  return(clean_stats)
}

la_liga_passing_stats_clean <- passing_stats("ESP")
ligue1_passing_stats_clean <- passing_stats("FRA")
premierleague_passing_stats_clean <- passing_stats("ENG")
serieA_passing_stats_clean <- passing_stats("ITA")
bundesliga_passing_stats_clean <- passing_stats("GER")

top5 <- rbind(
 la_liga_passing_stats_clean, ligue1_passing_stats_clean, premierleague_passing_stats_clean, serieA_passing_stats_clean, bundesliga_passing_stats_clean
)

options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("stargazer")

## 
## The downloaded binary packages are in
##  /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpfKCJ4o/downloaded_packages

library(stargazer)

top5 <- top5[, !names(top5) %in% c("Team_or_Opponent", "Mins_Per_90", "Season_End_Year", "Gender", "Country", "Num_Players")]

Question 2

Policy question: If teams in the top five Euro leagues want to assist more goals, should they be making more progressive passes?

Empirical question: What happens to the number of assists expected with each additional progressive pass?

In the world of football, progressive passes have become somewhat of a buzzword. Whether literally passing forward actually makes the difference implied in media is up to the test of this second econometrics project iteration! How football managers set up their teams and instruct players can end up depending a fair bit on statistics like progressive passes. With progressive passes in mind, managers may instruct their players to try and always make passes forward as opposed to playing it backwards in attempts to score more goals. This iteration of the project allows me to test this concept through the upcoming regression tests. In essence, are progressive passes ‘all that’?

Question 3

passing_regression <- lm(Ast ~ PrgP, data = top5)

summary(passing_regression)

## 
## Call:
## lm(formula = Ast ~ PrgP, data = top5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.147  -5.189  -1.081   5.941  16.740 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.86897    4.47860  -3.543 0.000617 ***
## PrgP          0.03746    0.00315  11.893  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.846 on 94 degrees of freedom
## Multiple R-squared:  0.6008, Adjusted R-squared:  0.5965 
## F-statistic: 141.4 on 1 and 94 DF,  p-value: < 2.2e-16

There is a high likelihood of omitted variable bias. As I’ve described above, there’s a lot of nuance when it comes to making an assist and scoring a goal.There’s a high likelihood that a fair bit of other variables also lead to assisting goals. That’s what makes football so unpredictable and exciting. For statisticians however this is a nightmare! Selection bias in this case is not really possible. Data such as the number of progressive passes played throughout a season are collected for every team, at every match. Whether these trends are emblematic for all professional football leagues is another story and if I were projecting these numbers for all of professional football I believe there would be a fair bit of selection bias in this respect. However, this iteration is meant to focus solely on the top five Euro football leagues.

Question 4

I’m working with the ‘Fbref’ dataset kindly compiled by JaseZiv for R studio use. This is data collected by the football league officials themselves and then compiled by Fbref a.k.a. ‘Stathead’. Fbref does a good job of condensing the massive amount of data collected by the leagues into one website from which this data can be downloaded. Passing aside, there is far more data in this dataset including shooting and defending-themed variables. For sake of simplicity, I’ve chosen to focus on the passing-themed data and the assist and progressive passes variables contained within.

Question 5

install.packages("stargazer")

## 
## The downloaded binary packages are in
##  /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpfKCJ4o/downloaded_packages

install.packages("ggplot2")

## 
## The downloaded binary packages are in
##  /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpfKCJ4o/downloaded_packages

install.packages("plotly")

## 
## The downloaded binary packages are in
##  /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpfKCJ4o/downloaded_packages

install.packages("RColorBrewer")

## 
## The downloaded binary packages are in
##  /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpfKCJ4o/downloaded_packages

library(stargazer)
library(ggplot2)
library(plotly)
library(RColorBrewer)

passing_regression <- lm(Ast ~ PrgP, data = top5)

stargazer(passing_regression, type = "text", 
  se = list(summary(passing_regression)$coefficients[2]), 
  dep.var.labels = "Assists", 
  covariate.labels = "Progressive Passes",
  style = "default")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               Assists          
## -----------------------------------------------
## Progressive Passes             0.037           
##                                                
##                                                
## Constant                    -15.869***         
##                               (0.037)          
##                                                
## -----------------------------------------------
## Observations                    96             
## R2                             0.601           
## Adjusted R2                    0.597           
## Residual Std. Error       8.846 (df = 94)      
## F Statistic           141.446*** (df = 1; 94)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

color_palette <- brewer.pal(8, "Set2")

intercept <- coef(passing_regression)[1]
slope <- coef(passing_regression)[2]

gg_plot <- ggplot(top5, aes(x = PrgP, y = Ast, text = paste("Team: ", Squad, "<br>Assists: ", Ast,"<br>Progressive Passes: ", PrgP), color = Competition_Name)) +
  geom_point(size = 1.5, alpha = 1) +
  geom_abline(intercept = intercept, slope = slope, col = "black") +  
  labs(title = "Assists as a Function of Progressive Passes",
       x = "Progressive Passes",
       y = "Assists") +
    scale_color_manual(values = color_palette) + 
  theme_classic() +
  theme(legend.position = "right") 

interactive_plot <- ggplotly(gg_plot, tooltip = "text")

interactive_plot

Question 6

The results from the graph are pleasing. We can observe a general trend in which with each additional progressive pass, the likelihood of an assist increases. Although few data points land on the trend line, this is an intuitive result. Teams that had more assists and subsequently those that ended up being more successful within their respective leagues, made more progressive passes. This includes teams like Bayer Leverkusen, Arsenal, Manchester City, Real Madrid, and Paris Saint-Germain. I like this plot because you can also toggle certain leagues. Curiously, this trend is not reflected as clearly with Serie A, the Italian football league. If you isolate the Italian teams on this graph, progressive passes seem to have little influence on the number of assists each team had. For example, the two teams with the most progressive passes in the league, Napoli and Atalanta, had 1758 and 1754 progressive passes, respectively. Despite this, Napoli had 36 assists while Atalanta had 57 assists. What a margin!

Question 7

Linearity

The relationship between the independent and dependent variable is linear. Hence the first classical assumption is upheld.

Sample Variation

There is sample variance in this case as there is variation in X. Sample variation assumption is therefore upheld.

Random Sampling

There is no random sampling. As mentioned previously, this dataset is not meant to resemble a population, instead it is the population! While the sample may not be randomly sampled, in this case, it means instead the entire potential sample was captured.

Exogeniety

Given a general intuition, the progressive passing variable is likely endogenous. There’s a high chance other variables in the sample influence and impact progressive passes. The crosses into the penalty area (CrsPA) stat and progressive pass (PrgP) stat in this case influence one another. A cross into the penalty area can be a progressive pass and vice-versa.

Homoskedasticity

install.packages("lmtest")

## 
## The downloaded binary packages are in
##  /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpfKCJ4o/downloaded_packages

library(lmtest)

Breusch-Pagan to test for homoskedasticity

bp_test <- bptest(passing_regression)

bp_test

## 
##  studentized Breusch-Pagan test
## 
## data:  passing_regression
## BP = 4.8921, df = 1, p-value = 0.02698

Given we have a P-value < 0.05 we fail to uphold the assumption of homoskedasticity or constant residual variance.

Residual graph for further homoskedasticity clarity

fitted_values <- fitted(passing_regression)
residuals <- residuals(passing_regression)

residuals_df <- data.frame(fitted_values = fitted_values, residuals = residuals)


ggplot(residuals_df, aes(x = fitted_values, y = residuals)) +
  geom_point(color = "dodgerblue3", alpha = 0.8) +  
  geom_hline(yintercept = 0, color = "black") +  
  labs(title = "Residuals Plotted Against Fitted Line",
       x = "Fitted Values",
       y = "Residuals") +
  theme_classic()

From this graph we can see a general spread of residuals reflecting a (rough) reverse bow tie shape as opposed to an ideal random cloud. As a result, this reflects a failure to uphold homoskedasticity.

Normality

Shapiro-Wilk test setup

shap_test <- shapiro.test(residuals(passing_regression))

shap_test

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(passing_regression)
## W = 0.98271, p-value = 0.2382

Given a P-value > 0.05, normality is upheld suggesting the distribution of points in this regression model is normal.

Question 8

Are my coefficient estimates BLUE? No, they are not. As homoskedasticity is not upheld through previous tests, OLS is not the best linear unbiased estimator. Because we do not know that progressive passes are causing more assists due to the tests run in the previous question, we cannot interpret these results causally. Instead, these results imply a strong correlation.

Question 9

Residual Density Distribution

prgp_residuals <- residuals(passing_regression)

ggplot(data.frame(prgp_residuals), aes(x = prgp_residuals)) +
  geom_density(fill = "dodgerblue3", alpha = 0.3) +
  labs(title = "Density Distribution of Residuals", x = "Residuals", y = "Density") +
  theme_classic()

The density distribution of residuals is not normally distributed. Similar to our BP test, residual distribution implies a correlative relationship as opposed to a causal one.

Question 10

confint(passing_regression)

##                   2.5 %      97.5 %
## (Intercept) -24.7613264 -6.97660501
## PrgP          0.0312107  0.04372022

Given our 95% interval for the coefficient is between 0.0312 and 0.0437 and this range does not include 0, it suggests progressive passes are an effective indicator of assists, rejecting the null hypothesis.

summary(passing_regression)

## 
## Call:
## lm(formula = Ast ~ PrgP, data = top5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.147  -5.189  -1.081   5.941  16.740 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.86897    4.47860  -3.543 0.000617 ***
## PrgP          0.03746    0.00315  11.893  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.846 on 94 degrees of freedom
## Multiple R-squared:  0.6008, Adjusted R-squared:  0.5965 
## F-statistic: 141.4 on 1 and 94 DF,  p-value: < 2.2e-16

Given a P-value < 0.05, again we can determine a statistically significant coefficient, rejecting the null hypothesis. The results of this test suggests progressive passes have a significant effect on the number of assists.

Question 11

Given the regression analysis conducted above, we can assume a very strong correlation but not causality. Given the BLUE test conducted previously, how we prove causality evidently proves very difficult. Proving that all classical assumptions are upheld is difficult, particularly because of the exogeneity assumptions: avoiding endogeniety is difficult. Hence for this model, despite technical correlation rather than causation, I would think progressive passes have a generally positive influence on the number of assists a team has: the more progressive passes a team plays, the more assists they will likely have but this doesn’t mean the former causes the latter.

Question 12

I would like to do similar regression stress tests with other variables in the dataset to see how different homoskedasticity and density distribution of residuals are compared with progressive passes. Constructing a passing index that includes multiple variables like progressive passes as an attempt to limit potential endogeniety would also be something worthwhile for the next iteration. Before I can say for certain that ‘progressive passes CAUSE more assists’, I would need to prove exogeneity, which in this case is close to impossible. Many times this is the roadblock statisticians run into, despite other factors describing a significant relationship between two variables, the full test disproves causality. The resulting relationship instead describes strong correlation. To disprove causality is a potent tool and this regression stress testing is something I will be doing more of as I continue to use R.

AI Statement I used Ai throughout this project primarily to debug and uncover features I didn’t know existed within R. The ‘plotly’ function is very cool and I wouldn’t have known about it if I hadn’t asked an Ai model if it had any graph beautification recommendations. I employed Ai with other issues I such as syntax and code minimization.

Luka Football Metrics Second Iteration

2024-11-05