Introduction

Since time immemorial fans, supporters and admirers of sport have watched and followed their chosen teams. Some follow via streaming devices, others receive updates from their friends and family and lastly, but perhaps more importantly, the special few were able to observe their team in a stadium. Fans and sport are intrinsically linked, often being named the 12th man on the field, the support swells and is championed by songs cheering on their favourite players, and when the opposition has possession of the ball the switch is flicked and a raucous booing can ensue creating a hostile environment for the opposing team. Sport and fans form a symbiotic relationship and there is a level of tribalism that sport heralds that can only be accompanied by fan participation.

Over the last eighteen months, supporters have not been permitted to attend matches in stadiums due to the Covid-19 outbreak and only recently have been allowed back into the arena.

Problem Statement

Utilising various statistical methodologies such as distributions, probability testing, regression analysis and estimating uncertaintly, this paper will attempt to define whether there has been a significant difference in a team’s performance when fans have been allowed in stadiums compared to when a home team plays without a fan base.

The factors that will be explored will be:

Change in home win rate
Change in goals scored

Data

The data utilised was retrieved initially from the Kaggle database and then enriched and validated further from Bet365, a betting agency based in the UK that provided all the necessary metrics and statistics for each premier league game
The key to the results data is shown as below for reference:

HomeTeam = The team playing at home

AwayTeam - The team playing away from home

HomeGoals = Full Time Home Team Goals

AwayGoals = Full Time Away Team Goals

Full-TimeResult = Full Time Result (H=Home Win, D=Draw, A=Away Win)

Covid = Whether the stadiums had home fans Y/N

To achieve my goal, I have downloaded the dplyr package to summarise my data in tables as well as code using the pipe operator %>%. I have also downloaded ggplot2 as a means of plotting my data in a visual format

A t-test will be conducted to see if there is a statistical significance between team performance and having crowds at stadiums. Additionally, I will utitlise various descriptive statistics and plots to visualise my data showcasing the difference between having crowds at stadiums and not

Descriptive Statistics and Visualisation

The below table denotes the mean, median and SD (standard deviation) of both Home Goals and Away Goals. Additionally, I have calculated the first, third and interquartile range as well as the maximum and minimum values provided by the sample. Furthermore; I have grouped it by Covid to showcase whether there is any discrenable difference from an initial analysis. I have removed any null values from the data using the “na.rm” function
Once the summary table was computed, I used various visualisation techniques such as box plots and histograms to explain and show the distribution of the summarised table. What is evident early on from the box plots is that there is a larger mean and the standard deviation is lower thus supporting that with fans in stadiums there is more chance of a home team scoring goals. An additional note is that there seems to also be a greater chance of scoring “many” goals with some outliers in the data however; there is no significant data to back up this hypothesis

EPL_Data <- read_csv("EPL Data.csv")

View(EPL_Data)
N <- EPL_Data%>%filter(Covid == "N")
Y <- EPL_Data%>%filter(Covid == "Y")

EPL_Data%>%group_by(Covid)%>% summarise(Mean = mean(HomeGoals,na.rm = TRUE), Median = median (HomeGoals,na.rm = TRUE), SD = sd (HomeGoals,na.rm = TRUE), Q1 = quantile(HomeGoals, probs = 0.25, na.rm = TRUE),Q3 = quantile(HomeGoals, probs = 0.75, na.rm = TRUE), IQR = IQR(HomeGoals,na.rm = TRUE), Maximum = max(HomeGoals,na.rm = TRUE), Minimum = min (HomeGoals,na.rm = TRUE)) ->table_1



EPL_Data%>%group_by(Covid)%>% summarise(Mean = mean(AwayGoals,na.rm = TRUE), Median = median (AwayGoals,na.rm = TRUE), SD = sd (AwayGoals,na.rm = TRUE), Q1 = quantile(AwayGoals, probs = 0.25, na.rm = TRUE),Q3 = quantile(AwayGoals, probs = 0.75, na.rm = TRUE), IQR = IQR(AwayGoals,na.rm = TRUE), Maximum = max(AwayGoals,na.rm = TRUE), Minimum = min (AwayGoals,na.rm = TRUE)) ->table_2


knitr::kable(table_1)

Covid	Mean	Median	SD	Q1	Q3	IQR	Maximum	Minimum
N	1.505587	1	1.224732	1	2	1	8	0
Y	1.543478	1	1.417419	0	2	2	5	0

knitr::kable(table_2)

Covid	Mean	Median	SD	Q1	Q3	IQR	Maximum	Minimum
N	1.192737	1	1.166408	0	2	2	9	0
Y	1.173913	1	1.182368	0	2	2	5	0

Descriptive Statistics and Visualisation Cont.

BoxPlot <-ggplot(data = EPL_Data, aes(x=Covid, y =HomeGoals)) + geom_boxplot(aes(fill=Covid))

BoxPlot+labs(title = "Covid Effect on Team Performance", x = "Covid", y= "Home Goals")+stat_summary(fun = mean, colour = "red",geom = "point")

BoxPlot <-ggplot(data = EPL_Data, aes(x=Covid, y =AwayGoals)) + geom_boxplot(aes(fill=Covid))

BoxPlot+labs(title = "Covid Effect on Team Performance", x = "Covid", y= "Away Goals")+stat_summary(fun = mean, colour = "red",geom = "point")

Plot1 <- ggplot(data = EPL_Data, aes(x = HomeGoals))

Plot1 + geom_histogram(fill = "Light Blue", colour = "black")  + geom_vline(aes(xintercept = mean(HomeGoals)),col='red',size=1)+facet_wrap(~Covid)

Descriptive Statistics and Visualisation Cont.

Below is the result of a Chi-Squared Test. The test was utilised as a means to determine the frequency between two categorical variables. In this case, the variables were Covid (Y/N), and the Full-Time Result (H/D/A)
The data shows that during Covid that the likelihood of a draw between two teams is greater than when there aren’t crowds in the stadium. From this we can infer that crowds play some part in swinging the result one way or the other.

table(EPL_Data$Covid, EPL_Data$`Full-TimeResult`)

##    
##       A   D   H
##   N 109  92 157
##   Y  29  20  43

chisq <- chisq.test(EPL_Data$Covid,EPL_Data$`Full-TimeResult`,correct = FALSE)

chisq

## 
##  Pearson's Chi-squared test
## 
## data:  EPL_Data$Covid and EPL_Data$`Full-TimeResult`
## X-squared = 0.62554, df = 2, p-value = 0.7314

corrplot(chisq$residuals, is.cor = FALSE)

model <- lm(`HomeGoals`~ Covid, data = EPL_Data)
summary(model)$coef

##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 1.50558659 0.06692323 22.4972180 1.470938e-75
## CovidY      0.03789167 0.14800939  0.2560085 7.980617e-01

EPL_Data$Covid <- as.factor(EPL_Data$Covid)

EPL_Data$`Full-TimeResult`<- as.numeric(as.character(EPL_Data$`Full-TimeResult`))

Hypothesis Testing

The Null Hypothesis formula is shown as

H0:μ1=μ2

The alternate hypothesis is shown as

HA:μ1≠μ2

The below two tests will be to determine whether we can reject the null hypothesis that fans have an effect on team goals during Covid - both for home and away teams. The P Value is not low enough i.e. 0.05 to show that there is any significance.

t.test(
  HomeGoals~Covid,
  data = EPL_Data,
  var.equal = FALSE,
  alternative = "two.sided"
  )

## 
##  Welch Two Sample t-test
## 
## data:  HomeGoals by Covid
## t = -0.23487, df = 128.07, p-value = 0.8147
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
##  -0.3571108  0.2813275
## sample estimates:
## mean in group N mean in group Y 
##        1.505587        1.543478

t.test(
  AwayGoals~Covid,
  data = EPL_Data,
  var.equal = FALSE,
  alternative = "two.sided"
  )

## 
##  Welch Two Sample t-test
## 
## data:  AwayGoals by Covid
## t = 0.13658, df = 139.98, p-value = 0.8916
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
##  -0.2536645  0.2913133
## sample estimates:
## mean in group N mean in group Y 
##        1.192737        1.173913

Discussion

A two-sample t-test was used to test the significant between whether fans in stadiums had an impact on a home team’s performance in the English Premier League. Even though there was some fans had some impact, there wasn’t enough to be able to fully reject the null hyptothesis.
From this we can infer that fans play a very minor part in a team’s performance. When one takes into consideration other variables such as home ground, travel, weather etc, it can easily interpreted that fans would not be a major factor in determining whether a team wins or loses.
I believe that with further analysis i.e. additional variables, linear and multiple regression that one could determine what the overarching variable is that impacts team performance

MATH1324 Introduction to Statistics Assignment 2

Has Covid-19 had a visible effect on Team Performance in the English Premier League

RPubs link information

Introduction

Problem Statement

Data

Descriptive Statistics and Visualisation

Descriptive Statistics and Visualisation Cont.

Descriptive Statistics and Visualisation Cont.

Hypothesis Testing

Discussion

References