Elia Lanza (s3918715(
Last updated: 17 October, 2021
Since time immemorial fans, supporters and admirers of sport have watched and followed their chosen teams. Some follow via streaming devices, others receive updates from their friends and family and lastly, but perhaps more importantly, the special few were able to observe their team in a stadium. Fans and sport are intrinsically linked, often being named the 12th man on the field, the support swells and is championed by songs cheering on their favourite players, and when the opposition has possession of the ball the switch is flicked and a raucous booing can ensue creating a hostile environment for the opposing team. Sport and fans form a symbiotic relationship and there is a level of tribalism that sport heralds that can only be accompanied by fan participation.
Over the last eighteen months, supporters have not been permitted to attend matches in stadiums due to the Covid-19 outbreak and only recently have been allowed back into the arena.
Utilising various statistical methodologies such as distributions, probability testing, regression analysis and estimating uncertaintly, this paper will attempt to define whether there has been a significant difference in a team’s performance when fans have been allowed in stadiums compared to when a home team plays without a fan base.
The factors that will be explored will be:
Change in home win rate
Change in goals scored
The data utilised was retrieved initially from the Kaggle database and then enriched and validated further from Bet365, a betting agency based in the UK that provided all the necessary metrics and statistics for each premier league game
The key to the results data is shown as below for reference:
HomeTeam = The team playing at home
AwayTeam - The team playing away from home
HomeGoals = Full Time Home Team Goals
AwayGoals = Full Time Away Team Goals
Full-TimeResult = Full Time Result (H=Home Win, D=Draw, A=Away Win)
Covid = Whether the stadiums had home fans Y/N
To achieve my goal, I have downloaded the dplyr package to summarise my data in tables as well as code using the pipe operator %>%. I have also downloaded ggplot2 as a means of plotting my data in a visual format
A t-test will be conducted to see if there is a statistical significance between team performance and having crowds at stadiums. Additionally, I will utitlise various descriptive statistics and plots to visualise my data showcasing the difference between having crowds at stadiums and not
The below table denotes the mean, median and SD (standard deviation) of both Home Goals and Away Goals. Additionally, I have calculated the first, third and interquartile range as well as the maximum and minimum values provided by the sample. Furthermore; I have grouped it by Covid to showcase whether there is any discrenable difference from an initial analysis. I have removed any null values from the data using the “na.rm” function
Once the summary table was computed, I used various visualisation techniques such as box plots and histograms to explain and show the distribution of the summarised table. What is evident early on from the box plots is that there is a larger mean and the standard deviation is lower thus supporting that with fans in stadiums there is more chance of a home team scoring goals. An additional note is that there seems to also be a greater chance of scoring “many” goals with some outliers in the data however; there is no significant data to back up this hypothesis
EPL_Data <- read_csv("EPL Data.csv")
View(EPL_Data)
N <- EPL_Data%>%filter(Covid == "N")
Y <- EPL_Data%>%filter(Covid == "Y")
EPL_Data%>%group_by(Covid)%>% summarise(Mean = mean(HomeGoals,na.rm = TRUE), Median = median (HomeGoals,na.rm = TRUE), SD = sd (HomeGoals,na.rm = TRUE), Q1 = quantile(HomeGoals, probs = 0.25, na.rm = TRUE),Q3 = quantile(HomeGoals, probs = 0.75, na.rm = TRUE), IQR = IQR(HomeGoals,na.rm = TRUE), Maximum = max(HomeGoals,na.rm = TRUE), Minimum = min (HomeGoals,na.rm = TRUE)) ->table_1
EPL_Data%>%group_by(Covid)%>% summarise(Mean = mean(AwayGoals,na.rm = TRUE), Median = median (AwayGoals,na.rm = TRUE), SD = sd (AwayGoals,na.rm = TRUE), Q1 = quantile(AwayGoals, probs = 0.25, na.rm = TRUE),Q3 = quantile(AwayGoals, probs = 0.75, na.rm = TRUE), IQR = IQR(AwayGoals,na.rm = TRUE), Maximum = max(AwayGoals,na.rm = TRUE), Minimum = min (AwayGoals,na.rm = TRUE)) ->table_2
knitr::kable(table_1)| Covid | Mean | Median | SD | Q1 | Q3 | IQR | Maximum | Minimum |
|---|---|---|---|---|---|---|---|---|
| N | 1.505587 | 1 | 1.224732 | 1 | 2 | 1 | 8 | 0 |
| Y | 1.543478 | 1 | 1.417419 | 0 | 2 | 2 | 5 | 0 |
| Covid | Mean | Median | SD | Q1 | Q3 | IQR | Maximum | Minimum |
|---|---|---|---|---|---|---|---|---|
| N | 1.192737 | 1 | 1.166408 | 0 | 2 | 2 | 9 | 0 |
| Y | 1.173913 | 1 | 1.182368 | 0 | 2 | 2 | 5 | 0 |
BoxPlot <-ggplot(data = EPL_Data, aes(x=Covid, y =HomeGoals)) + geom_boxplot(aes(fill=Covid))
BoxPlot+labs(title = "Covid Effect on Team Performance", x = "Covid", y= "Home Goals")+stat_summary(fun = mean, colour = "red",geom = "point")BoxPlot <-ggplot(data = EPL_Data, aes(x=Covid, y =AwayGoals)) + geom_boxplot(aes(fill=Covid))
BoxPlot+labs(title = "Covid Effect on Team Performance", x = "Covid", y= "Away Goals")+stat_summary(fun = mean, colour = "red",geom = "point")Plot1 <- ggplot(data = EPL_Data, aes(x = HomeGoals))
Plot1 + geom_histogram(fill = "Light Blue", colour = "black") + geom_vline(aes(xintercept = mean(HomeGoals)),col='red',size=1)+facet_wrap(~Covid)Below is the result of a Chi-Squared Test. The test was utilised as a means to determine the frequency between two categorical variables. In this case, the variables were Covid (Y/N), and the Full-Time Result (H/D/A)
The data shows that during Covid that the likelihood of a draw between two teams is greater than when there aren’t crowds in the stadium. From this we can infer that crowds play some part in swinging the result one way or the other.
##
## A D H
## N 109 92 157
## Y 29 20 43
##
## Pearson's Chi-squared test
##
## data: EPL_Data$Covid and EPL_Data$`Full-TimeResult`
## X-squared = 0.62554, df = 2, p-value = 0.7314
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.50558659 0.06692323 22.4972180 1.470938e-75
## CovidY 0.03789167 0.14800939 0.2560085 7.980617e-01
H0:μ1=μ2
The alternate hypothesis is shown as
HA:μ1≠μ2
The below two tests will be to determine whether we can reject the null hypothesis that fans have an effect on team goals during Covid - both for home and away teams. The P Value is not low enough i.e. 0.05 to show that there is any significance.
##
## Welch Two Sample t-test
##
## data: HomeGoals by Covid
## t = -0.23487, df = 128.07, p-value = 0.8147
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -0.3571108 0.2813275
## sample estimates:
## mean in group N mean in group Y
## 1.505587 1.543478
##
## Welch Two Sample t-test
##
## data: AwayGoals by Covid
## t = 0.13658, df = 139.98, p-value = 0.8916
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -0.2536645 0.2913133
## sample estimates:
## mean in group N mean in group Y
## 1.192737 1.173913
A two-sample t-test was used to test the significant between whether fans in stadiums had an impact on a home team’s performance in the English Premier League. Even though there was some fans had some impact, there wasn’t enough to be able to fully reject the null hyptothesis.
From this we can infer that fans play a very minor part in a team’s performance. When one takes into consideration other variables such as home ground, travel, weather etc, it can easily interpreted that fans would not be a major factor in determining whether a team wins or loses.
I believe that with further analysis i.e. additional variables, linear and multiple regression that one could determine what the overarching variable is that impacts team performance
Football Data UK (2021) https://www.football-data.co.uk/englandm.php
BetGPS (2021) https://betgps.com/betting/betting-data-excel-workbook/