This project will be less focused on making plots and visualizations of performance, and more focused on understanding which variables are predictors of success. I will also investigate how well different types of models perform.

I will begin by briefly describing the dataset. This data comes from FBref, a site which aggregates data from soccer matches in select competitions around the world. At the time I collected this data, FBref was obtaining their data from StatsBomb, but near the end of the year 2022, they switched to a new provider, Opta. So, it is likely there are discrepancies in the data you see below compared to what you would see if you repeated this process now, as the data from Opta is different from the StatsBomb data. However, some of the metrics of data I obtained here are not available in the new Opta data, so I will stick with the StatsBomb data throughout this analysis.

The data I gathered from FBref is a collection of 249 team-based (as opposed to player-based) metrics from each of the 98 clubs who competed in the 2021/22 season in the “Big 5” European leagues: The Premier League (England), La Liga (Spain), Ligue 1 (France), Serie A (Italy), and The Bundesliga (Germany). ALL cumulative stats are adjusted to “per match” to account for the Bundesliga playing fewer matches in a season than the other leagues.

Linear Regression:

I will start by using linear regression to predict total goals scored per game.

##     Predictors Correlation
## 1           xA   0.9158715
## 2         npxG   0.9145603
## 3          SoT   0.9108059
## 4  PenaltyArea   0.8583749
## 5      Att Pen   0.8534365
## 6      ProgRec   0.8501028
## 7         Prog   0.8501012
## 8     PassLive   0.8477226
## 9          SCA   0.8430512
## 10          KP   0.8398023
## 11         Att   0.8296132
## 12   FinalThrd   0.8280414
## 13     Touches   0.8265076
## 14         Cmp   0.8217371
## 15     Carries   0.8160869
## 16      MedAtt   0.8131842
## 17    GroundPs   0.8129139
## 18      MedCmp   0.8094933
## 19  TotDistCar   0.8091854
## 20     TotDist   0.8091568
## 21     PrgDist   0.8082426
## 22          Sh   0.8059073

I began by removing irrelevant variables from the dataset (variables specific to goalkeeping and defensive metrics). Then, I selected only the variables which have correlation coefficients > 0.8 or < -0.8 relative to goals scored. I then removed the variables which inherently include goals scored in their computation (assists, goal creating actions, goals per shot), and was left with the 22 variables above. In decreasing order of correlation, they are: Expected assists, non-penalty expected goals, shots on target, passes into opposition penalty area, touches in opposition penalty area, progressive receptions, progressive passes, shot-creating live ball passes, shot creating actions, key passes, passes attempted, passes into final third, touches, passes completed, carries, medium distance passes attempted, ground passes attempted, medium distance passes completed, total distance carried, total passing distance, progressive passing distance, and shots attempted.

That is still a lot of variables for the sake of interpretability, but I suspect many of those variables are highly correlated with each other and can be removed.

By removing a member of each pair with correlation coefficient >0.95, I was able to reduce to just 8 variables, those being non-penalty expected goals, shots on target, touches in opposition penalty area, progressive receptions, shot creating actions, passes attempted, passes into final third, and progressive passing distance.

##   Variable                     Correlation With Goals
## A Non-Penalty xG               0.91                  
## B Shots on Target              0.91                  
## C Opp. Penalty Area Touches    0.86                  
## D Progressive Receptions       0.85                  
## E Shot-Creating Actions        0.84                  
## F Passes Attempted             0.83                  
## G Passes Into Final Third      0.83                  
## H Progressive Passing Distance 0.81

Before doing model fitting, we should make sure that the relationships between these predictors and goals scored looks linear.

None of these relationships look clearly nonlinear, so it seems we are okay to attempt to fit a linear regression model. Since the remaining variables are all still pretty strongly correlated with each other, we will start with just one variable (non-penalty xG, the most correlated with goals scored), and add more until returns diminish.

Using k-fold cross validation, we can see that we can get improved results by including up to 7 variables. However, once we include the 8th variable (progressive passing distance), we start to see signs of overfitting (decrease in model performance due to including noise). So, we will include 7 variables in our model.

## 
## Call:
## lm(formula = Gls ~ ., data = lmdta)
## 
## Coefficients:
## (Intercept)         npxG          SoT    `Att Pen`      ProgRec          SCA  
##   -0.462401     0.427938     0.350790     0.010547    -0.003202    -0.052671  
##         Att    FinalThrd  
##    0.001024     0.008804

Performance:

Ultimately, this model only gets us about 0.03 squared goals/game closer in our predictions than merely using expected goals to predict, however, this holds true across cross-validation simulations. This is only about 1 goal closer per season per team in terms of prediction. But does it work on new data? This may be tough to answer as FBref switched providers, so I don’t have identical data to compare with for the current season, nor for past seasons as the original source of this data is gone. But, the reality is that most of these metrics should still exist with the new provider (Opta) and should be of similar values, since they are measuring the same things.

We can visually compare how the model with all 7 predictors performs on the new data from the current Premier League season compared to the model with just non-penalty xG as the predictor.

Overall, both models do pretty well, and visually, they appear to perform similarly well. The line represents where perfect predictions would land on the plot.

How do they compare mathematically?

## [1] 0.2456064

## [1] 0.2364364

The model with all 7 predictors does perform slightly better on the new data, but only by 0.009 goals per game, which is a negligible difference over the course of one season. The overall errors are worse compared to the older data, by about 0.06 goals per 90 for the 1 predictor model, and by about 0.08 goals per 90 for the 7 predictor model, which we wouldn’t necessarily expect given that we fit the model using cross-validation to avoid overfitting, but is not necessarily surprising given that the new data came from a different provider.

So, is there any point to including anything more than non-penalty xG? It seems like there probably isn’t. But we can check.

## Analysis of Variance Table
## 
## Model 1: Gls ~ npxG + SoT + `Att Pen` + ProgRec + SCA + Att
## Model 2: Gls ~ npxG + SoT + `Att Pen` + ProgRec + SCA + Att + FinalThrd
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     91 2.0682                           
## 2     90 2.0414  1  0.026863 1.1843 0.2794

The p-value for the 7th predictor in our model, passes into the final third, being significant is 0.28, meaning that there isn’t any evidence that it is making our model better.

## Analysis of Variance Table
## 
## Model 1: Gls ~ npxG + SoT + `Att Pen` + ProgRec + SCA
## Model 2: Gls ~ npxG + SoT + `Att Pen` + ProgRec + SCA + Att
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     92 2.3529                                  
## 2     91 2.0682  1   0.28473 12.528 0.0006339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our 6th predictor, passes attempted, does appear to improve the model, with p-value 0.0006. We will keep that predictor in the model.

## Analysis of Variance Table
## 
## Model 1: Gls ~ npxG + SoT + `Att Pen` + ProgRec + Att
## Model 2: Gls ~ npxG + SoT + `Att Pen` + ProgRec + SCA + Att
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     92 2.3490                                  
## 2     91 2.0682  1   0.28082 12.356 0.0006874 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Shot creating actions also appear to improve the model to a statistically significant degree. We will keep this predictor.

## Analysis of Variance Table
## 
## Model 1: Gls ~ npxG + SoT + `Att Pen` + SCA + Att
## Model 2: Gls ~ npxG + SoT + `Att Pen` + ProgRec + SCA + Att
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     92 2.0697                           
## 2     91 2.0682  1 0.0014857 0.0654 0.7988

Progressive Recepetions do not appear to significantly improve the model, as the ANOVA p-value is a huge 0.80. We will remove it from the model.

## Analysis of Variance Table
## 
## Model 1: Gls ~ npxG + SoT + SCA + Att
## Model 2: Gls ~ npxG + SoT + `Att Pen` + SCA + Att
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     93 2.1395                              
## 2     92 2.0697  1   0.06977 3.1013 0.08155 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Attacking penalty area touches also do not appear to significantly improve the model. We can remove this predictor.

## Analysis of Variance Table
## 
## Model 1: Gls ~ npxG + SCA + Att
## Model 2: Gls ~ npxG + SoT + SCA + Att
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     94 2.7859                                  
## 2     93 2.1395  1   0.64641 28.099 7.731e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Including shots on target improves our model to a highly significant degree. We should keep this predictor.

## Analysis of Variance Table
## 
## Model 1: Gls ~ SoT + SCA + Att
## Model 2: Gls ~ npxG + SoT + SCA + Att
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     94 2.4822                                  
## 2     93 2.1395  1   0.34276 14.899 0.0002092 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Including non-penalty expected goals also significantly improves the model’s performance.

So overall, it seems as though we really should only include non-penalty expected goals, shots on target, shot creating actions, and attempted passes in our model.

## 
## Call:
## lm(formula = Gls ~ npxG + SoT + SCA + Att, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31547 -0.09280 -0.00745  0.07007  0.45247 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5556737  0.1060830  -5.238 1.01e-06 ***
## npxG         0.5253122  0.1360920   3.860 0.000209 ***
## SoT          0.3201717  0.0604004   5.301 7.73e-07 ***
## SCA         -0.0382537  0.0126092  -3.034 0.003131 ** 
## Att          0.0014483  0.0003355   4.317 3.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1517 on 93 degrees of freedom
## Multiple R-squared:  0.8903, Adjusted R-squared:  0.8856 
## F-statistic: 188.7 on 4 and 93 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Gls
##           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## npxG       1 16.3145 16.3145 709.1706 < 2.2e-16 ***
## SoT        1  0.5412  0.5412  23.5269 4.933e-06 ***
## SCA        1  0.0813  0.0813   3.5336   0.06327 .  
## Att        1  0.4286  0.4286  18.6327 3.954e-05 ***
## Residuals 93  2.1395  0.0230                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is something strange about these results which I will mention shortly. But first, lets see how this performs on the new data compared to the other models.

## [1] 0.2344426

This model actually performs slightly better on the new data than the model including all 7 of the predictors and the model which only included non-penalty xG. However, the adjusted R-squared on the model with all of the predictors is slightly higher. Ultimately, the difference in performance between these two models is very small.

## Analysis of Variance Table
## 
## Model 1: Gls ~ npxG + SoT + SCA + Att
## Model 2: Gls ~ npxG + SoT + `Att Pen` + ProgRec + SCA + Att + FinalThrd
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1     93 2.1395                          
## 2     90 2.0414  3  0.098118 1.442 0.2359

The p-value for the model with 7 predictors performing the same as the model with 4 predictors is 0.236, indicating that we do not have evidence for significant difference in performance between those two models.

## Analysis of Variance Table
## 
## Model 1: Gls ~ npxG + SoT + SCA + Att
## Model 2: Gls ~ npxG
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     93 2.1395                                  
## 2     96 3.1906 -3   -1.0512 15.231 3.859e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

On the other hand, the ANOVA test for similarity in performance between the 1 predictor model and the 4 predictor model returns a very low p-value (0.00000004), indicating that the 4 predictor model is significantly better.

Now, back to the weirdness I mentioned earlier. I’ll print the 4 predictor model summary again.

## 
## Call:
## lm(formula = Gls ~ npxG + SoT + SCA + Att, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31547 -0.09280 -0.00745  0.07007  0.45247 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5556737  0.1060830  -5.238 1.01e-06 ***
## npxG         0.5253122  0.1360920   3.860 0.000209 ***
## SoT          0.3201717  0.0604004   5.301 7.73e-07 ***
## SCA         -0.0382537  0.0126092  -3.034 0.003131 ** 
## Att          0.0014483  0.0003355   4.317 3.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1517 on 93 degrees of freedom
## Multiple R-squared:  0.8903, Adjusted R-squared:  0.8856 
## F-statistic: 188.7 on 4 and 93 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Gls
##           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## npxG       1 16.3145 16.3145 709.1706 < 2.2e-16 ***
## SoT        1  0.5412  0.5412  23.5269 4.933e-06 ***
## SCA        1  0.0813  0.0813   3.5336   0.06327 .  
## Att        1  0.4286  0.4286  18.6327 3.954e-05 ***
## Residuals 93  2.1395  0.0230                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The weirdness lies in the following observation. The p-value for the coefficient for shot-creating actions (SCA) being zero is significant, indicating that shot creating actions do have some sort of an effect on goalscoring. However, in the ANOVA table, the p-value for SCA having an effect on goals scored is not significant. Furthermore, the estimated effect that SCA has on goals scored in the model is actually a negative effect, which is counter-intuitive if you follow soccer. Even if you don’t follow soccer, it seems strange that having more SCA would be associated with less scoring, as it would pretty clearly be associated with more shots and as such, more chances to score.

Visually, we can clearly see that the association between SCA and goals is not a negative one. Ultimately, weird results like this are the result of having predictors which are positively correlated with one another.

##           npxG       SoT       SCA       Att
## npxG 1.0000000 0.9280603 0.8946086 0.8166056
## SoT  0.9280603 1.0000000 0.9357512 0.7846102
## SCA  0.8946086 0.9357512 1.0000000 0.8073893
## Att  0.8166056 0.7846102 0.8073893 1.0000000

As we can see from the correlation matrix, all of the predictors are positively correlated with each other, most of them strongly.

If we take all combinations of linear models from this set of predictors which include SCA, which ones return a negative coefficient for SCA?

## 
## Call:
## lm(formula = Gls ~ SCA, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40148 -0.16898 -0.02726  0.13083  0.82884 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.614040   0.131119  -4.683 9.29e-06 ***
## SCA          0.102519   0.006675  15.358  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2424 on 96 degrees of freedom
## Multiple R-squared:  0.7107, Adjusted R-squared:  0.7077 
## F-statistic: 235.9 on 1 and 96 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Gls ~ SCA + npxG, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30468 -0.13501 -0.00033  0.10793  0.58274 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.26866    0.10584  -2.538   0.0128 *  
## SCA          0.01515    0.01119   1.355   0.1788    
## npxG         1.09121    0.12498   8.731 8.48e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1815 on 95 degrees of freedom
## Multiple R-squared:  0.8395, Adjusted R-squared:  0.8361 
## F-statistic: 248.5 on 2 and 95 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Gls ~ SCA + SoT, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39028 -0.10495 -0.02113  0.14207  0.65359 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.482138   0.102250  -4.715 8.27e-06 ***
## SCA         -0.009031   0.014576  -0.620    0.537    
## SoT          0.494699   0.060487   8.179 1.26e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1867 on 95 degrees of freedom
## Multiple R-squared:  0.8303, Adjusted R-squared:  0.8267 
## F-statistic: 232.3 on 2 and 95 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Gls ~ SCA + Att, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44340 -0.15745 -0.02403  0.14136  0.60831 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.9202007  0.1305297  -7.050 2.85e-10 ***
## SCA          0.0605120  0.0100423   6.026 3.18e-08 ***
## Att          0.0022942  0.0004428   5.181 1.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2152 on 95 degrees of freedom
## Multiple R-squared:  0.7745, Adjusted R-squared:  0.7697 
## F-statistic: 163.1 on 2 and 95 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Gls ~ SCA + npxG + SoT, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32851 -0.11624 -0.01412  0.09569  0.55962 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.30478    0.09671  -3.152  0.00218 ** 
## SCA         -0.02272    0.01317  -1.725  0.08782 .  
## npxG         0.72654    0.13934   5.214 1.09e-06 ***
## SoT          0.29742    0.06557   4.536 1.69e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1653 on 94 degrees of freedom
## Multiple R-squared:  0.8683, Adjusted R-squared:  0.8641 
## F-statistic: 206.6 on 3 and 94 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Gls ~ SCA + npxG + Att, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33985 -0.11474 -0.01002  0.10761  0.48866 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.4901922  0.1195878  -4.099  8.8e-05 ***
## SCA          0.0038655  0.0111126   0.348 0.728732    
## npxG         0.9364607  0.1269251   7.378  6.3e-11 ***
## Att          0.0012930  0.0003794   3.408 0.000964 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1722 on 94 degrees of freedom
## Multiple R-squared:  0.8572, Adjusted R-squared:  0.8526 
## F-statistic:   188 on 3 and 94 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Gls ~ SCA + SoT + Att, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37439 -0.09051 -0.00083  0.10044  0.47962 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.7457127  0.1006743  -7.407 5.48e-11 ***
## SCA         -0.0342793  0.0134642  -2.546   0.0125 *  
## SoT          0.4530480  0.0531734   8.520 2.56e-13 ***
## Att          0.0018919  0.0003377   5.602 2.10e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1625 on 94 degrees of freedom
## Multiple R-squared:  0.8727, Adjusted R-squared:  0.8687 
## F-statistic: 214.9 on 3 and 94 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Gls ~ SCA + npxG + SoT + Att, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31547 -0.09280 -0.00745  0.07007  0.45247 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5556737  0.1060830  -5.238 1.01e-06 ***
## SCA         -0.0382537  0.0126092  -3.034 0.003131 ** 
## npxG         0.5253122  0.1360920   3.860 0.000209 ***
## SoT          0.3201717  0.0604004   5.301 7.73e-07 ***
## Att          0.0014483  0.0003355   4.317 3.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1517 on 93 degrees of freedom
## Multiple R-squared:  0.8903, Adjusted R-squared:  0.8856 
## F-statistic: 188.7 on 4 and 93 DF,  p-value: < 2.2e-16

As we can see, on it’s own, SCA has a highly significant positive relationship with goals scored. However, as soon as we add in other predictors, except for pass attempts alone, the estimated effect of SCA becomes either negative or not significantly different from 0.

This seems to be an example of a key problem in statistical inference and prediction, namely that sometimes we can improve our predictive ability at the expense of interpretability. We get better predictions on new data while including all 4 of the predictors, despite the fact that one of the relationships in the model, the negative effect had on goals scored by shot-creating actions, makes no logical sense. Ultimately, the question of whether or not to accordingly include SCA in the model depends on our goal. How important is it that we are able to explain the features of the model? If we only want to make the best possible predictions, then it isn’t really that important to understand the underlying relationships. But if we want to be able to explain the mechanics of our results, then including SCA in this model is counterproductive.

However, there is one more thing we haven’t checked. Do the effects some of our predictors have on goals scored depend on the values of the other predictors? Are there significant interaction effects we could include to improve the model? For simplicity’s sake, I’m only going to include the 4 predictors I used in the last model, as including all 7 I started off with would result in a very large number of interactions and would be difficult to interpret.

## 
## Call:
## lm(formula = Gls ~ npxG * SoT * SCA * Att, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31147 -0.09264 -0.00583  0.07385  0.35656 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -6.054e-01  8.167e+00  -0.074    0.941
## npxG              6.760e-01  8.194e+00   0.082    0.934
## SoT               3.833e-01  2.733e+00   0.140    0.889
## SCA               1.772e-01  4.137e-01   0.428    0.670
## Att              -2.222e-03  1.843e-02  -0.121    0.904
## npxG:SoT          4.411e-02  1.890e+00   0.023    0.981
## npxG:SCA         -1.237e-01  4.136e-01  -0.299    0.766
## SoT:SCA          -5.042e-02  1.129e-01  -0.447    0.656
## npxG:Att          3.034e-04  1.652e-02   0.018    0.985
## SoT:Att           5.222e-04  5.840e-03   0.089    0.929
## SCA:Att          -2.005e-04  7.622e-04  -0.263    0.793
## npxG:SoT:SCA      2.233e-02  6.461e-02   0.346    0.731
## npxG:SoT:Att     -2.497e-05  4.026e-03  -0.006    0.995
## npxG:SCA:Att      1.745e-04  6.436e-04   0.271    0.787
## SoT:SCA:Att       5.740e-05  2.094e-04   0.274    0.785
## npxG:SoT:SCA:Att -3.699e-05  1.140e-04  -0.325    0.746
## 
## Residual standard error: 0.1516 on 82 degrees of freedom
## Multiple R-squared:  0.9034, Adjusted R-squared:  0.8858 
## F-statistic: 51.14 on 15 and 82 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Model 1: Gls ~ npxG + SoT + SCA + Att
## Model 2: Gls ~ npxG * SoT * SCA * Att
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1     93 2.1395                          
## 2     82 1.8835 11   0.25595 1.013 0.4426

By including the full model with all of the interactions, we get no significant improvement in model performance (p-value 0.44) and none of the effects, not even the main effects, are significant. This is clearly a bad choice. What happens if we only include one interaction effect? The one that seems most likely to be significant to me is that it seems like shot-creating actions may be less important for teams who attempt fewer passes than for teams who attempt more, so I will include that interaction effect and no others.

## 
## Call:
## lm(formula = Gls ~ npxG + SoT + SCA * Att, data = G_Data_rm_coln)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31633 -0.09278 -0.00755  0.06996  0.45198 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.868e-01  4.617e-01  -1.271 0.207020    
## npxG         5.261e-01  1.373e-01   3.832 0.000232 ***
## SoT          3.199e-01  6.081e-02   5.261 9.28e-07 ***
## SCA         -3.661e-02  2.696e-02  -1.358 0.177841    
## Att          1.506e-03  8.959e-04   1.681 0.096222 .  
## SCA:Att     -2.978e-06  4.303e-05  -0.069 0.944968    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1525 on 92 degrees of freedom
## Multiple R-squared:  0.8903, Adjusted R-squared:  0.8844 
## F-statistic: 149.4 on 5 and 92 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Model 1: Gls ~ npxG + SoT + SCA + Att
## Model 2: Gls ~ npxG + SoT + SCA * Att
##   Res.Df    RSS Df  Sum of Sq      F Pr(>F)
## 1     93 2.1395                            
## 2     92 2.1394  1 0.00011141 0.0048  0.945

There is no significant interaction between SCA and attempted passes, and we get no improvement in model performance from including this interaction term.

Ultimately, it seems that the 4 predictor, no interaction term model is the best we can do using linear regression.

Soccer Prediction

Oliver Linehan

1/18/2023

Linear Regression:

Performance: