Question 1

In the previous iteration of this metrics project, I regressed the number of assists a team had within the 2024/25 season onto the number of progressive passes (PrgP) they played. Running these regressions is thanks in part to the Fbref data set compiled by JaseZiv. This data set captures the full population (all teams within the leagues captured and all categorical stats for each team too). The scope of the teams is within the ‘top 5’ European leagues, in total 96 teams. As a result of the second iteration, I’ve found that there’s a strong positive correlation between the number of progressive passes a team plays and the number of assists they get within the 2024/25 season. In other words, the more progressive passes a team plays, the more likely they are to make an assist. However, in the second iteration I could not prove causality, this was due to a failed exogeniety test. Hence, with this third iteration, I believe that including more variables into the model can provide more decisive conclusions.

Question 2

For the estimate model, I’ve added the passes into opposition penalty area (PPA) variable: the total number of passes a team has played within the opponent’s penalty area.

With this variable in mind, we can add it to our new multi-variable regression estimate. Instead of only including progressive passes, we include penalty area passes too. In goals of avoiding omitted variable bias, and a precise estimate, penalty area passes are a likely component that explain the number of assists gets in a season. Adding penalty area passes can increase the accuracy of the estimate. Below is the model for the estimate.

\[\widehat{Assists_i} = \hat{\beta_0} + \widehat{\beta_1 Progressive Passes_i} + \widehat{\beta_2 Penalty Area Passes_i} + u_i\]

Question 3

Classical Assumption Testing:

Linearity

The model is a linear estimate. Hence linearity will be upheld.

No Perfect Collinearity

Progressive passes and penalty area passes will likely have overlap but not suffer from collinearity. While a progressive pass can be a penalty area pass, there can be progressive passes all over the field. Likewise, a penalty area pass can be a progressive pass. But this doesn’t mean collinearity. For example, the two variables are not progressive passes and ‘non-progressive passes’.

Random Sampling

There is no random sampling. This data set is not meant to resemble a population, instead it is the population! The entire potential sample was captured.

Exogeniety

Compared with a single variable regression, this model, including more variables provides reason to believe there is a greater exogeniety than a single variable estimate. As mentioned, progressive passes can be penalty area passes and vice versa. There are other variables in the data that can influence both penalty area and progressive passes, in this way, achieving perfect exogeniety is simply not possible.

Heteroskedasticity

mv_model <- lm(Ast ~ PrgP + PPA, data = top5)

residuals <- resid(mv_model)
fitted_values <- fitted(mv_model)

ggplot(data = data.frame(Fitted = fitted_values, Residuals = residuals), aes(x = Fitted, y = Residuals)) +
  geom_point(color = "palegreen3", alpha = 0.8) +
  geom_hline(yintercept = 0, color = "mediumorchid4") +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_classic()

As we can see from the residual vs fitted value graph, there’s a wide variance between the residuals. Given there’s not a constant spread of residuals as the fitted value increases, the graph reflects heteroskedasticity. Homoskedasticity is not upheld.

Question 4

stargazer(mv_model, type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 Ast            
## -----------------------------------------------
## PrgP                           0.006           
##                               (0.007)          
##                                                
## PPA                          0.128***          
##                               (0.026)          
##                                                
## Constant                     -9.573**          
##                               (4.220)          
##                                                
## -----------------------------------------------
## Observations                    96             
## R2                             0.682           
## Adjusted R2                    0.675           
## Residual Std. Error       7.936 (df = 93)      
## F Statistic           99.749*** (df = 2; 93)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Question 5

For the progressive passing estimate, the coefficient is 0.006, not significant. The estimate suggests that for every additional assist a team has, they are estimated to play ~166 progressive passes. For the passes into the penalty area statistic, regarded as significant, for every assist, the estimated number of passes into the penalty area are ~8. The R2 reads 0.682, not significant. Nor does the adjusted R2 suggest any significance at 0.675. Together, the R2 describes an estimate which doesn’t really capture the population.

Question 6

mvmodel_confint <- confint(mv_model, level = 0.95)

mvmodel_confint
##                     2.5 %      97.5 %
## (Intercept) -17.954366031 -1.19229335
## PrgP         -0.008254839  0.01986807
## PPA           0.076042706  0.18052970

This model suggests our progressive passes variable is not significant for the assist estimate. As the PrgP variable’s confidence interval crosses 0, we fail to prove significance. Passes into the penalty area however do prove significant as it doesn’t cross 0.

Question 7

linearHypothesis(mv_model, c("PrgP=0", "PPA=0"))
## 
## Linear hypothesis test:
## PrgP = 0
## PPA = 0
## 
## Model 1: restricted model
## Model 2: Ast ~ PrgP + PPA
## 
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     95 18422.2                                  
## 2     93  5857.4  2     12565 99.749 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given an F-value of ~100 and a significant P-value, this joint test rejects the null hypothesis. In essence, together, progressive passes and passes into the penalty area explain significant changes to the number of assists a team has during the 2024/25 season.

Question 8

passing_regression <- lm(Ast ~ PrgP, data = top5)
prgp_residuals <- residuals(passing_regression)

combined_residuals <- data.frame(
  Residuals = c(residuals(mv_model), prgp_residuals),
  Source = c(rep("PrgP + PPA Residual Density Plot", length(residuals(mv_model))), 
             rep("PrgP Residual Density Plot", length(prgp_residuals)))
)

ggplot(combined_residuals, aes(x = Residuals, color = Source, fill = Source)) +
  geom_density(alpha = 0.3) +
  scale_color_manual(values = c("black", "black")) +
  scale_fill_manual(values = c("lightpink", "dodgerblue2")) +
  labs(title = "Density Distribution of Residuals",
       x = "Residuals",
       y = "Density",
       color = "Key",
       fill = "Key") +
  theme_classic()

As is visible in the pink graph above, the residuals are not normally distributed. However, compared with the previous simple regression residual plot, graphed in blue, the residuals are more normally distributed.

Question 9

Including passes into the penalty area has been clearly helpful. By including passes into the penalty area, we’ve been able to better scrutinize the progressive passes variable. In fact so much so that I’ve realized there are better variables with which I can estimate the number of assists a team has in the season. In the previous simple regression, the conclusion was that progressive passes had a very strong correlation with the number of assists a team had. However, after constructing a new multi-variable model that included another variable, it seems that progressive passes may not be as significant an indicator as was estimated in the previous project iteration. The difference may be simply in the fact progressive passes are too loosely correlated with assists compared with passes in the penalty area. This would make sense particularly in the context of the residual density plot in which the multi-variable regression residual plot reflect a more normal density distribution curve.

Question 10

With the findings of these new tests, in hopes of explaining the main sources of soccer teams’ number of assists, constructing new models that don’t use progressive passes may make more sense. Given the significance we’ve found with passes into the penalty area, it may make more sense to conduct more tests incorporating PPA and other variables to see if there’s any other similarly significant variables.

AI Statement

I used AI throughout this project primarily to debug and uncover features I didn’t know existed within R. I employed AI with other issues I such as syntax and code minimization.