Author

Problem Statement

In a 2006 football match between American Samoa and Australia, Australia scored 31 goals and American Samoa scored zero, link. Despite the total goal range being incredibly high in this particular football match, the range of total goals in most football matches are not. This brings into question what distribution do goals take across a ninety minute football match? To provide insight into this question, the theoretical Poisson distribution used the mean expected number of goals for a match to form a probability distribution. This was compared to the empirical distribution produced by total goals per game from a sample of football matches in the top four English leagues. Further, probabilities of parameters zero, two, three, four and greater than five goals were compared between the empirical and Poisson distributions. The Poisson distribution was chosen over other theoretical distributions because of the current study context. Data is collected over the duration of one football match of ninety minutes. The frequency of goals is counted over football match duration, goals cannot happen simultaneously and goals from one football match are independent from another football match. One thing to approach cautiously about using a Poisson distribution to model football goals is the independence of goals during a ninety minute period. For the purpose of this study, goals in one match will be considered independent.

\(P(k) = e^{-\lambda} \frac{\lambda^k} {k!}\)

Load Packages

#install.packages("devtools")
knitr::opts_chunk$set(echo = TRUE)
library(devtools)
#install_github('jalapic/engsoccerdata', username = "jalapic")
library(engsoccerdata)
library(dplyr)

Data

The dataset was obtained via the website GitHub. A package called ‘devtools’ was installed to allow access to the ‘engsoccerdata’ package. Engsoccerdata contains numerous sets of data. The set chosen for this study was ‘england’ (James P. Curley (2016). engsoccerdata: English Soccer Data 1871-2016. R package version 0.1.5). The football leagues observed in the ‘england’ dataset are the English Premier League, Football League Championship, Football League One and Football League Two. Data spans across 192004 football matches and ranges from 1888 to 2016. Data across numerous variables are collected in this dataset. Which include date, stadium, home team goals, away team goals, total goals and etc. This study is interested in observing the number of goals per match, so only the sum of goals scored per match is analysed. This was achieved by using the ‘totgoal’ variable. No manipulation to data was required and a search for missing variables in the data set did not return any results.

new<-subset(england, england$totgoal=="NA") #In an attempt to find missing observations; assign missing observations to a variable
nrow(new) #return number of rows in missing observation variable, no missing observations were found
[1] 0

The number of rows in the created varialbe new is zero. Meaning there are no missing observations in the totgoal variable.

Distribution Fitting

The following calculates the mean of the empirical distribution. It is used as a parameter in the poisson distribution.

TG <- england$totgoal #Assign variable of interest to a more accessible name
mean(TG) %>% round(3) #Calculate empirical mean
[1] 2.877

Expected mean of sample is 2.88 goals per game.

The following code plots a Poisson distribution with the mean of total goals scored across one time period; \(\lambda\) = 2.88

#Set Poisson parameters
lambda <- 2.88
mu <- lambda
# Set sequence of x values to plot
Events <- seq(ifelse(sign(round(mu-sqrt(mu)*4,0))==-1,0,round(mu-sqrt(mu)*4,0)),round(mu+sqrt(mu)*4,0))
# Calculate PMF
PMF <- dpois(x = Events, lambda = mu)
# Plot PMF
hist(TG, freq = FALSE, right = FALSE, ylim = c(0, 0.25),xlab = "Goals", ylab = "PMF", main = "Distribution of Goals \n Poisson Distribution, Mean = 2.88")
points(Events, PMF, type = "p",main = paste("Poisson Distribution, Mean = ",mu), xlim = c(0,17), col = "blue")
lines(Events, PMF, type = "l", col= "blue")

The histogram is skewed to the right, the majority of observations are between zero and five goals. As shown in the plot above, the theoretical distribution predicts the empirical distribution very closely. The Poisson distribution under predicts games with zero goals and it over predicts games with three and four goals. However, it is a near perfect prediction of games with one and two goals.

From the engsoccer sample, the majority of matches have a range of goals between zero and five. Out of interest, the probability of six or more goals happening in a match was examined.

ppois(5,2.88,lower.tail = FALSE) %>% round(3)
[1] 0.072

Using the Poisson distribution with lambda 2.88, the probability of six or more goals happening in a match is 0.072.

The following chart visuallises this.

# Set Poisson parameters. 
lambda <- 2.88
mu <- lambda
# Define PMF to highlight - Pr(X < x), Pr(X > x), or Pr(a < x < b)
x <- ""
a <- 6
b <- 12
# Set sequence of x values to plot
Events <- seq(ifelse(sign(round(mu-sqrt(mu)*4,0))==-1,0,round(mu-sqrt(mu)*4,0)),round(mu+sqrt(mu)*4,0))
# Calculate PMF
PMF <- dpois(x = Events, lambda = mu)
# Define points to highlight in plot
highlight <- ifelse(Events <= b & Events >= a | Events == x, "red", "blue")
# Plot PMF
plot(Events, PMF, type = "p", main = paste("Poisson Distribution, Mean = ",mu), col = highlight)
lines(Events, PMF, type = "h", col = highlight)

The cumulative value of the lines coloured red equals the probability of a match having more than five goals (0.072).

Next, a table presented empirical probabilities, Poisson probabilites, and the difference between them. This was done by observing matches with zero, two, three, four and greater than five goals.

#Subset sample and calcualte probabilities; assigned variables
zeroG<-nrow(subset(england, england$totgoal==0))/length(TG)
twoG<-nrow(subset(england, england$totgoal==2))/length(TG)
threeG<-nrow(subset(england, england$totgoal==3))/length(TG)
fourG<-nrow(subset(england, england$totgoal==4))/length(TG)
Grt5G<-nrow(subset(england, england$totgoal>5))/length(TG)
#create data frame of probabilities and difference
Poisson_probability <- c(dpois(0, 2.88),dpois(2, 2.88), dpois(3, 2.88),dpois(4, 2.88), ppois(5,2.88,lower.tail = FALSE)) %>% round(3)
Empirical_probability <- c(zeroG, twoG, threeG, fourG, Grt5G) %>% round(3)
Number_of_Goals<- c(0,2,3,4, "5<")
Empirical_Poisson_Difference<-(Empirical_probability-Poisson_probability) %>%round(3)
#Dataframe
df<- data.frame(Number_of_Goals, Empirical_probability, Poisson_probability,Empirical_Poisson_Difference)
#table
knitr::kable(df, caption = "Empirical, Poisson & Difference")
Number_of_Goals Empirical_probability Poisson_probability Empirical_Poisson_Difference
0 0.072 0.056 0.016
2 0.232 0.233 -0.001
3 0.208 0.223 -0.015
4 0.153 0.161 -0.008
5< 0.085 0.072 0.013

Interpretation

From output it appears the theoretical poisson distribution with \(\lambda\)=2.88 fits the empirical distribution very well. The final table stored in the df dataframe compares probability values between the empirical and Poisson distributions. Certain k points of interest between the empirical and theoretical distributions were zero, two, three, four and greater than five goals in the leagues observed. These points show the predicted accuracy in the Poisson distribution. The probability difference between the empirical and Poisson distribution in games with a total of two goals is 0.001, this difference is very small. The probability difference between games with zero, three and greater than five goals is 0.016, 0.015 and 0.013 respectively. These points display the largest probability difference between the empirical and Poisson distribution. The next largest probability difference is four goals with 0.008. Despite the existence of differences between the empirical and Poisson distributions, the Poisson distribution appears to predict the distribution of total goals per match very closely to the empirical distribution. Needless to say, limitation of using the Poisson distribution is the lack of analysis surrounding the variance of the empirical distribution. The variance could potentially be used to explain differences between empirical and theoretical distributions. Also, the Poisson distribution overlooks the independence of goals that happen within one match period of ninety minutes.

