MATH1324 Introduction to Statistics Assignment 2

Has Covid-19 had a visible effect on Team Performance in the English Premier League

Elia Lanza (s3918715(

Last updated: 17 October, 2021

Introduction

Since time immemorial fans, supporters and admirers of sport have watched and followed their chosen teams. Some follow via streaming devices, others receive updates from their friends and family and lastly, but perhaps more importantly, the special few were able to observe their team in a stadium. Fans and sport are intrinsically linked, often being named the 12th man on the field, the support swells and is championed by songs cheering on their favourite players, and when the opposition has possession of the ball the switch is flicked and a raucous booing can ensue creating a hostile environment for the opposing team. Sport and fans form a symbiotic relationship and there is a level of tribalism that sport heralds that can only be accompanied by fan participation.

Over the last eighteen months, supporters have not been permitted to attend matches in stadiums due to the Covid-19 outbreak and only recently have been allowed back into the arena.

Problem Statement

Utilising various statistical methodologies such as distributions, probability testing, regression analysis and estimating uncertaintly, this paper will attempt to define whether there has been a significant difference in a team’s performance when fans have been allowed in stadiums compared to when a home team plays without a fan base.

The factors that will be explored will be:

Data

HomeTeam = The team playing at home

AwayTeam - The team playing away from home

HomeGoals = Full Time Home Team Goals

AwayGoals = Full Time Away Team Goals

Full-TimeResult = Full Time Result (H=Home Win, D=Draw, A=Away Win)

Covid = Whether the stadiums had home fans Y/N

To achieve my goal, I have downloaded the dplyr package to summarise my data in tables as well as code using the pipe operator %>%. I have also downloaded ggplot2 as a means of plotting my data in a visual format

A t-test will be conducted to see if there is a statistical significance between team performance and having crowds at stadiums. Additionally, I will utitlise various descriptive statistics and plots to visualise my data showcasing the difference between having crowds at stadiums and not

Descriptive Statistics and Visualisation

EPL_Data <- read_csv("EPL Data.csv")

View(EPL_Data)
N <- EPL_Data%>%filter(Covid == "N")
Y <- EPL_Data%>%filter(Covid == "Y")

EPL_Data%>%group_by(Covid)%>% summarise(Mean = mean(HomeGoals,na.rm = TRUE), Median = median (HomeGoals,na.rm = TRUE), SD = sd (HomeGoals,na.rm = TRUE), Q1 = quantile(HomeGoals, probs = 0.25, na.rm = TRUE),Q3 = quantile(HomeGoals, probs = 0.75, na.rm = TRUE), IQR = IQR(HomeGoals,na.rm = TRUE), Maximum = max(HomeGoals,na.rm = TRUE), Minimum = min (HomeGoals,na.rm = TRUE)) ->table_1



EPL_Data%>%group_by(Covid)%>% summarise(Mean = mean(AwayGoals,na.rm = TRUE), Median = median (AwayGoals,na.rm = TRUE), SD = sd (AwayGoals,na.rm = TRUE), Q1 = quantile(AwayGoals, probs = 0.25, na.rm = TRUE),Q3 = quantile(AwayGoals, probs = 0.75, na.rm = TRUE), IQR = IQR(AwayGoals,na.rm = TRUE), Maximum = max(AwayGoals,na.rm = TRUE), Minimum = min (AwayGoals,na.rm = TRUE)) ->table_2


knitr::kable(table_1)
Covid Mean Median SD Q1 Q3 IQR Maximum Minimum
N 1.505587 1 1.224732 1 2 1 8 0
Y 1.543478 1 1.417419 0 2 2 5 0
knitr::kable(table_2)
Covid Mean Median SD Q1 Q3 IQR Maximum Minimum
N 1.192737 1 1.166408 0 2 2 9 0
Y 1.173913 1 1.182368 0 2 2 5 0

Descriptive Statistics and Visualisation Cont.

BoxPlot <-ggplot(data = EPL_Data, aes(x=Covid, y =HomeGoals)) + geom_boxplot(aes(fill=Covid))

BoxPlot+labs(title = "Covid Effect on Team Performance", x = "Covid", y= "Home Goals")+stat_summary(fun = mean, colour = "red",geom = "point")

BoxPlot <-ggplot(data = EPL_Data, aes(x=Covid, y =AwayGoals)) + geom_boxplot(aes(fill=Covid))

BoxPlot+labs(title = "Covid Effect on Team Performance", x = "Covid", y= "Away Goals")+stat_summary(fun = mean, colour = "red",geom = "point")

Plot1 <- ggplot(data = EPL_Data, aes(x = HomeGoals))

Plot1 + geom_histogram(fill = "Light Blue", colour = "black")  + geom_vline(aes(xintercept = mean(HomeGoals)),col='red',size=1)+facet_wrap(~Covid)

Descriptive Statistics and Visualisation Cont.

table(EPL_Data$Covid, EPL_Data$`Full-TimeResult`)
##    
##       A   D   H
##   N 109  92 157
##   Y  29  20  43
chisq <- chisq.test(EPL_Data$Covid,EPL_Data$`Full-TimeResult`,correct = FALSE)

chisq
## 
##  Pearson's Chi-squared test
## 
## data:  EPL_Data$Covid and EPL_Data$`Full-TimeResult`
## X-squared = 0.62554, df = 2, p-value = 0.7314
corrplot(chisq$residuals, is.cor = FALSE)

model <- lm(`HomeGoals`~ Covid, data = EPL_Data)
summary(model)$coef
##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 1.50558659 0.06692323 22.4972180 1.470938e-75
## CovidY      0.03789167 0.14800939  0.2560085 7.980617e-01
EPL_Data$Covid <- as.factor(EPL_Data$Covid)

EPL_Data$`Full-TimeResult`<- as.numeric(as.character(EPL_Data$`Full-TimeResult`))

Hypothesis Testing

H0:μ1=μ2

The alternate hypothesis is shown as

HA:μ1≠μ2

The below two tests will be to determine whether we can reject the null hypothesis that fans have an effect on team goals during Covid - both for home and away teams. The P Value is not low enough i.e. 0.05 to show that there is any significance.

t.test(
  HomeGoals~Covid,
  data = EPL_Data,
  var.equal = FALSE,
  alternative = "two.sided"
  ) 
## 
##  Welch Two Sample t-test
## 
## data:  HomeGoals by Covid
## t = -0.23487, df = 128.07, p-value = 0.8147
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
##  -0.3571108  0.2813275
## sample estimates:
## mean in group N mean in group Y 
##        1.505587        1.543478
t.test(
  AwayGoals~Covid,
  data = EPL_Data,
  var.equal = FALSE,
  alternative = "two.sided"
  ) 
## 
##  Welch Two Sample t-test
## 
## data:  AwayGoals by Covid
## t = 0.13658, df = 139.98, p-value = 0.8916
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
##  -0.2536645  0.2913133
## sample estimates:
## mean in group N mean in group Y 
##        1.192737        1.173913

Discussion

References