Using linear regression to detect trends in English Premier League team’s stats in the 2018 to 2019 season.
Data obtained from https://footystats.org/download-stats-csv#.
2025-03-15
Using linear regression to detect trends in English Premier League team’s stats in the 2018 to 2019 season.
Data obtained from https://footystats.org/download-stats-csv#.
For example, consider the 3d plot of Away PPG, Home PPG, and Total PPG. How can we tell if there is a definite relationship, i.e, if you are winning away games, then you are also winning home games?
We can precisely tell if there is relationship between these variables using linear regression.
What linear regression produces is a formula for a line in the form \(y = \alpha + \beta x\). Linear regression can reveal a relation between variables and be used to predict future results or reveal trends, as well as highlight outliers.
To calculate linear regression, the least squares estimates is often used to find the values of \(\alpha\) and \(\beta\). Where
\(\alpha = \overline{y} -\beta \overline{x} \\ \beta = \frac{\sum_{i=1}^{n} - \frac{(\sum_{i=1}^{n}y_i)(\sum_{i=1}^{n}x_i)}{n} }{\sum_{i=1}^{n}x_i^2 - \frac{(\sum_{i=1}^{n} x _i)^2}{n} }\)
Which are then used as the coefficients for linear regression.
Here we can see that when compared to other teams, visually, it appears as though Arsenal are better at home, and Crystal Palace are better away, while all the other teams follow a line, implying a relationship.
When we obtain the p-value, which is a measure of how statistically significant the relationship is, we have a p value of less than 0.05, meaning we can conclude there is a relationship between Away PPG and Home PPG, leading to the conclusion that if you win at away games, you are most likely winning a similar amount at home, expect for those outliers, which would require further examination.
g = ggplot(df, aes(x = points_per_game_away,
y = points_per_game_home,
label = common_name))
g = g + geom_smooth(method="lm", formula = y ~ x, level = 0.95)
g = g + xlab("Away PPG") + ylab("Home PPG")
g = g + geom_point() + geom_text(hjust=0.5, vjust=0)
g
Consider if we wanted to tell if there is a relationship between possession and Points per game.
## ## Call: ## lm(formula = df$points_per_game ~ df$average_possession) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.82419 -0.06974 0.01597 0.16936 0.51003 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.24924 0.42737 -2.923 0.00908 ** ## df$average_possession 0.05305 0.00842 6.301 6.13e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3174 on 18 degrees of freedom ## Multiple R-squared: 0.688, Adjusted R-squared: 0.6707 ## F-statistic: 39.7 on 1 and 18 DF, p-value: 6.126e-06
As we can see from the summary of the linear regression between PPG and Possession, The P value of the linear regression is less than 0.05, thus we can conclude that there is a strong relationship between Possession and Points Per Game.