### Packages
library(tidyverse)
library(corrplot)
### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df |>
  filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))

New Model

My original model just looked at modeling goals scored (score1) with xG (xg1). Now, I will include an interaction between team strength (spi1) and xg1, as it is likely that stronger teams will produce better chances. I also want to see if more important matches lead to more goals scored, while accounting for team strength and xG. Thus, the new model is:

\[score1 = \beta_0+\beta_1(xg1) + \beta_2(spi1) + \beta_3(importance1) + \beta_4(xg1*spi1)\]

m2 <- lm(score1 ~ xg1*spi1 + importance1, data = df_top_leagues)
summary(m2)
## 
## Call:
## lm(formula = score1 ~ xg1 * spi1 + importance1, data = df_top_leagues)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2121 -0.7353 -0.0974  0.6274  4.9183 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.485e-01  1.328e-01  -1.871   0.0614 .  
## xg1          7.944e-01  7.466e-02  10.639  < 2e-16 ***
## spi1         7.917e-03  1.964e-03   4.032 5.58e-05 ***
## importance1 -5.603e-05  4.616e-04  -0.121   0.9034    
## xg1:spi1     3.366e-04  1.019e-03   0.330   0.7413    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 8811 degrees of freedom
##   (314 observations deleted due to missingness)
## Multiple R-squared:  0.3472, Adjusted R-squared:  0.3469 
## F-statistic:  1171 on 4 and 8811 DF,  p-value: < 2.2e-16

We get this fitted model:

\[\widehat{score1} = -0.25 + 0.79(xg1) + 0.79(spi1) - 0.000056(importance1) + 0.00037(xg1*spi1)\]

The most important part of this model that we notice is the coefficient of the interaction term at 0.00037. This indicates that stronger teams are more efficient at converting chances into goals, as the relationship between xG and actual goals becomes stronger as team strength increases according to this coefficient. However, from the t-test associated with this model, we can see that the interaction term has a very high p-value (0.74), so this variable is likely not statistically significant in determining goals scored.

Similarly, with importance, we see a very small coefficient (almost 0), with a very small large, indicating that there is no strong relationship between match importance and actual goals scored.

From this model, we can interpret our terms for xG and SPI individually. Holding team strength and match importance constant, we would expect a team to score 0.79 more actual goals with an increase of 1 xG. For SPI, if we hold xG and match importance constant, we would expect a team to score 0.79 more goals as well if we increase SPI by 1. It is important to note, though, that since SPI is on a scale from 0 - 1, this interpretation basically says the best possible team would be expected to score 0.79 more goals than the worst possible team, holding the other variables constant.

Based on this, I would likely just keep xG and SPI in the model, and negate the interaction term and match importance.

Diagnostics

Residuals vs Fitted

plot(m2, which = 1)

Residuals vs X

model_data <- model.frame(m2)

plot(model_data$xg1, resid(m2))
abline(h = 0, col = "red")

Based on these 2 plots, we do see a slight funnel shape, but nothing too alarming. Naturally, smaller xG values will generally see less fluctuation in actual goals scored, as this is also associated with less total shots and just worse chances in general. It is much more difficult for outside noise (finishing quality, goalkeeping quality, etc.) to affect how much goals scored varies with less shots and less xG overall. The slight funnel shape does suggest minor heteroscedasticity, or non-constant variance, but it only really seems to be an issue for smaller values of xG or our model’s predicted goals.

Multicollinearity Heat Map

vars <- df_top_leagues[, c("xg1", "spi1", "importance1")]
cor_matrix <- cor(vars, use = "complete.obs")

corrplot(cor_matrix, method = "color", addCoef.col = "black")

There is a slight concern for collinearity between SPI and xG (0.38), though it is not overly strong. Match importance does not seem to be too strongly correlated with xG, but does have a slight correlation with SPI. However, we are likely to remove match importance from the model anyways, so this point is not concerning.

Residual Histogram

hist(resid(m2), breaks = 30)

QQ Plot

plot(m2, which = 2)

While the QQ-Plot does have a slightly long upper tail, the combination of this plot (with a majority of points along the line) and a generally bell-shaped histogram of residuals, the residuals do appear to be roughly normally distributed. Thus, I would say that this assumption is satisfied.

Overall, I believe our model assumptions of constant variance, normality of errors, the linear relationship between our response and explanatory variables, independent observations, and no major correlation between explanatory variables are reasonably satisfied. There are some slight concerns for collinearity and hereroscedasticity, but these are minor concerns and should not substantially impact any conclusions. The collinearity issue, in particular, feels generally unavoidable in this sport context, as team quality is naturally going to show some relationship with most performance-related variables.

The model, from these terms, that I would suggest going forward would just include xg1 and spi1, as the other 2 terms showed little statistical significance.