### Packages
library(tidyverse)
### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df |>
filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))
Home field advantage is a widely accepted phenomenon in sports where the team playing at home experiences a slight “boost” in performance compared to playing away or at a neutral site. This occurs due to a myriad of factors, including travel fatigue, a hostile crowd towards the away team, familiarity, etc.. I want to test whether this home field advantage factor does actually exist in terms of goals scored. With 5 seasons of data across 5 leagues, I believe we have enough data to safely perform a hypothesis test on this.
Here, “Group A” would be score1 (Home Goals) and “Group
B” would be score2 (Away Goals). These are the Null and
Alternative hypotheses:
\[H_0: \mu_{home} = \mu_{away}\] \[H_A: \mu_{home} > \mu_{away}\]
I have chosen to go with \(\alpha\) = 0.05. This is generally a standard Type I Error rate in sport analytics, and feels appropriate for a home field advantage context. While there are other factors that obviously play into how many goals a team scores, over 5 years and 5 leagues worth of data including both excellent and poor teams we should have data to claim an \(\alpha\) of this low. In this context, a type I error means we claim a home field advantage exists when none does.
Then, I have chosen a Power \((1-\beta)\) of 0.8. Again, this is a fairly standard measurement for sports analytics contexts, and makes sense with the amount of data that we have. We can safely draw conclusions about our hypothesis test given these parameters with the amount of data we have.
Lastly, I have gone with a minimum effect size of \(\delta = 0.2\) goals. Given how much team quality factors in to goal scoring, even in home vs away matches, we don’t want to go with a super high \(\delta\) value for our minimum effect size. However, a \(\delta\) of 0.2 goals, when looked at over a season (38 or 34 games for the Bundesliga), a minimume effect size of 0.2 means that a team would score 7.6 or 6.8 more goals at home than away, respectively. When 1 goal can really change a game and the subsequent result, a margin like this can seriously shift league standings.
Our notes about sample size can be proven below:
### Sample Size Calculation
power.t.test(
delta = 0.20,
sd = sd(c(df_top_leagues$score1, df_top_leagues$score2), na.rm = TRUE),
sig.level = 0.05,
power = 0.80,
type = "two.sample",
alternative = "one.sided"
)
##
## Two-sample t test power calculation
##
## n = 487.7157
## delta = 0.2
## sd = 1.255198
## sig.level = 0.05
## power = 0.8
## alternative = one.sided
##
## NOTE: n is number in *each* group
This test shows that in order to detect the 0.2 goal difference with our given parameters, we would only need ~488 matches per group. Our dataframe has 17867 rows, so we have plenty of data.
### Testing
t.test(df_top_leagues$score1,
df_top_leagues$score2,
alternative = "greater",
var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: df_top_leagues$score1 and df_top_leagues$score2
## t = 16.752, df = 17867, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.2800605 Inf
## sample estimates:
## mean of x mean of y
## 1.543582 1.233027
### Reshaping for Visualization
df_long <- df_top_leagues |>
select(score1, score2) |>
pivot_longer(cols = everything(),
names_to = "team_type",
values_to = "goals")
### Boxplot of home vs away goals
ggplot(df_long, aes(x = team_type, y = goals)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Home vs Away Goals") +
theme_bw()
Based on the tests, we can reject the null hypothesis that there is no difference between the mean home and away goals. The p-value is effectively 0, much lower than our \(\alpha = 0.05\) type I error rate. Our lower bound on the confidence interval is 0.28, which means that we are 95% confident the true difference in home vs away goals is at least 0.28 goals per match.
Our test calculates a difference between mean home and away goals of ~0.31. This would imply that over a 38 game season, a team is expected to score 11.78 more goals at home than away, which definitely has huge standings implications.
Our box plot also shows this. While the median for the 2 is the same, our 1st and 2nd quantile for home goals is 1 goal, while it is at 0 goals for the away team. This implies that at least 75% of home teams score at least 1 goal, while we can only say at least 50% for the away team. It is interesting that there is a much higher spread for away teams, but this really hammers down on the true essence of home field advantage in soccer. Team quality is obviously a huge factor, but simply being at home clearly provides an expected boost in performance. It doesn’t determine who wins, but it plays a contributing factor.
Now, I want to explore how SPI affects win probabilities. Naturally,
one would think that a team with a higher SPI (better team) would have a
higher win probability than a team with a lower SPI, but there are other
factors that go into predicting the outcome. Here, we will group teams
into a “High SPI” and a “Low SPI” group, split at the median based on
spi1. We just use spi1 since the SPI
calculation includes a small, consistent home field advantage factor, so
if the 2 teams played with the same SPI rankings, but the home and away
teams flip, the SPI values would just shift proportionally given the
home field advantage factor. Thus, we can safely just look at Home
SPI.
### Creating SPI split based on the median
median_spi <- median(df_top_leagues$spi1, na.rm = TRUE)
df_top_leagues <- df_top_leagues |>
mutate(
spi_group = ifelse(spi1 >= median_spi, "High SPI", "Low SPI")
)
Here, we can use a 2 sided test. Thus, our hypotheses are:
\[H_0: \mu_{winprob, High SPI} = \mu_{winprob, LowSPI}\] \[H_A: \mu_{winprob, High SPI} \neq \mu_{winprob, LowSPI}\]
t.test(prob1 ~ spi_group, data = df_top_leagues)
##
## Welch Two Sample t-test
##
## data: prob1 by spi_group
## t = 55.389, df = 8020.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High SPI and group Low SPI is not equal to 0
## 95 percent confidence interval:
## 0.1772027 0.1902056
## sample estimates:
## mean in group High SPI mean in group Low SPI
## 0.5463729 0.3626688
### Boxplot of win prob by SPI group
ggplot(df_top_leagues, aes(x = spi_group, y = prob1)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Win Probability by SPI Group") +
theme_bw()
From the t-test, we can reject our null hypothesis that there is no difference in the mean win probability for High and Low SPI teams. Again, we essentially have a p-value of 0, giving strong evidence against the null. We are 95% confident that the true difference in mean win probability for high and low SPI teams is between 0.177 and 0.190. This is a narrow interval, indicating the strength, stability, and precision of our test. Further, our t-test calculates a difference in means of around 0.184. The data clearly provides strong evidence to suggest that having a higher SPI generally leads to a higher win probability as we would expect.
This is echoed in the box plots. Both groups have a similar spread, but the median and each quartile value are higher for the High SPI group than in the low SPI group. This means that generally in the data, having a higher SPI will indicate a higher win probability.