2025-03-15

Intro

3D Plot

For example, consider the 3d plot of Away PPG, Home PPG, and Total PPG. How can we tell if there is a definite relationship, i.e, if you are winning away games, then you are also winning home games?

Basics of Linear Regression

We can precisely tell if there is relationship between these variables using linear regression.

What linear regression produces is a formula for a line in the form \(y = \alpha + \beta x\). Linear regression can reveal a relation between variables and be used to predict future results or reveal trends, as well as highlight outliers.

Least Square Estimates

To calculate linear regression, the least squares estimates is often used to find the values of \(\alpha\) and \(\beta\). Where

\(\alpha = \overline{y} -\beta \overline{x} \\ \beta = \frac{\sum_{i=1}^{n} - \frac{(\sum_{i=1}^{n}y_i)(\sum_{i=1}^{n}x_i)}{n} }{\sum_{i=1}^{n}x_i^2 - \frac{(\sum_{i=1}^{n} x _i)^2}{n} }\)

Which are then used as the coefficients for linear regression.

Away vs Home Points Per Game (PPG)

Here we can see that when compared to other teams, visually, it appears as though Arsenal are better at home, and Crystal Palace are better away, while all the other teams follow a line, implying a relationship.

P-Value

When we obtain the p-value, which is a measure of how statistically significant the relationship is, we have a p value of less than 0.05, meaning we can conclude there is a relationship between Away PPG and Home PPG, leading to the conclusion that if you win at away games, you are most likely winning a similar amount at home, expect for those outliers, which would require further examination.

Code for Previous Slide

g = ggplot(df, aes(x = points_per_game_away, 
                   y = points_per_game_home, 
                   label = common_name))
g = g + geom_smooth(method="lm", formula = y ~ x, level = 0.95)
g = g + xlab("Away PPG") + ylab("Home PPG")
g = g + geom_point() + geom_text(hjust=0.5, vjust=0)
g

Another Example

Consider if we wanted to tell if there is a relationship between possession and Points per game.

Stats

## 
## Call:
## lm(formula = df$points_per_game ~ df$average_possession)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82419 -0.06974  0.01597  0.16936  0.51003 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.24924    0.42737  -2.923  0.00908 ** 
## df$average_possession  0.05305    0.00842   6.301 6.13e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3174 on 18 degrees of freedom
## Multiple R-squared:  0.688,  Adjusted R-squared:  0.6707 
## F-statistic:  39.7 on 1 and 18 DF,  p-value: 6.126e-06

Conclusion

As we can see from the summary of the linear regression between PPG and Possession, The P value of the linear regression is less than 0.05, thus we can conclude that there is a strong relationship between Possession and Points Per Game.