Objective Of this Research

This research focuses on analyzing stadium utilization of all 20 clubs in the English Premier League 2018/19 season of all games played. In particular, this project tests the potential of different statistical models to predict stadium utilization based on home and away points per game, and total goals scored. Models including linear regression, logistic regression, and several general additive models were used and produced interesting results, yet the capabilities of these results are limited due to the lack of larger observations.

Data

The data utilized in this research comes from <whoscored.com> and we will try to focus on the three variables namely; Home & away points per game, and the total goals scored.

data <- read.csv(file = "~/Desktop/612 data.csv")
attach(data)
summary(data)
##    attendance     capacity      utilisation                 home_team_name
##  10792  :  2   Min.   :11329   0.99   : 86   AFC Bournemouth       : 19   
##  24263  :  2   1st Qu.:25960   0.98   : 52   Arsenal               : 19   
##  40491  :  2   Median :33280   1      : 35   Brighton & Hove Albion: 19   
##  52908  :  2   Mean   :42086   0.97   : 32   Burnley               : 19   
##  74519  :  2   3rd Qu.:55556   0.95   : 29   Cardiff City          : 19   
##  74523  :  2   Max.   :90000   0.92   : 19   Chelsea               : 19   
##  (Other):368                   (Other):127   (Other)               :266   
##                 away_team_name    home_ID         Away_ID        home_ppg    
##  AFC Bournemouth       : 19    Min.   : 1.00   Min.   : 1.0   Min.   :0.470  
##  Arsenal               : 19    1st Qu.: 5.75   1st Qu.: 6.5   1st Qu.:1.210  
##  Brighton & Hove Albion: 19    Median :10.50   Median :11.5   Median :1.475  
##  Burnley               : 19    Mean   :10.50   Mean   :11.0   Mean   :1.615  
##  Cardiff City          : 19    3rd Qu.:15.25   3rd Qu.:16.0   3rd Qu.:1.917  
##  Chelsea               : 19    Max.   :20.00   Max.   :20.0   Max.   :2.840  
##  (Other)               :266                                                  
##     away_ppg     home_team_goal_count away_team_goal_count total_goal_count
##  Min.   :0.260   Min.   :0.000        Min.   :0.000        Min.   :0.000   
##  1st Qu.:0.840   1st Qu.:1.000        1st Qu.:0.000        1st Qu.:2.000   
##  Median :1.160   Median :1.000        Median :1.000        Median :3.000   
##  Mean   :1.198   Mean   :1.568        Mean   :1.253        Mean   :2.821   
##  3rd Qu.:1.542   3rd Qu.:2.000        3rd Qu.:2.000        3rd Qu.:4.000   
##  Max.   :2.320   Max.   :6.000        Max.   :6.000        Max.   :8.000   
##                                                                            
##  home_team_possession away_team_possession                 stadium   
##  Min.   :23.00        Min.   :22.00        Anfield             : 19  
##  1st Qu.:43.00        1st Qu.:41.00        Cardiff City Stadium: 19  
##  Median :52.00        Median :48.00        Craven Cottage      : 19  
##  Mean   :51.51        Mean   :48.49        Emirates Stadium    : 19  
##  3rd Qu.:59.00        3rd Qu.:57.00        Etihad Stadium      : 19  
##  Max.   :78.00        Max.   :77.00        Goodison Park       : 19  
##                                            (Other)             :266
str(data)
## 'data.frame':    380 obs. of  15 variables:
##  $ attendance          : Factor w/ 373 levels "10199","10227",..: 354 261 4 30 84 75 164 155 278 336 ...
##  $ capacity            : int  74879 52338 11329 21577 25700 24500 32050 32384 54074 60704 ...
##  $ utilisation         : Factor w/ 36 levels "#VALUE!","0.32",..: 34 34 26 28 32 33 32 30 33 34 ...
##  $ home_team_name      : Factor w/ 20 levels "AFC Bournemouth",..: 14 15 1 18 9 10 20 16 12 2 ...
##  $ away_team_name      : Factor w/ 20 levels "AFC Bournemouth",..: 11 17 5 3 7 6 8 4 19 13 ...
##  $ home_ID             : int  2 15 14 12 16 19 17 11 3 5 ...
##  $ Away_ID             : int  8 16 18 20 9 4 7 13 10 1 ...
##  $ home_ppg            : num  1.89 1.32 1.53 1.42 1.11 0.47 1.79 1.21 2.79 2.37 ...
##  $ away_ppg            : num  1.32 1.74 0.74 0.68 1.53 1.58 1.05 0.89 1.11 2.32 ...
##  $ home_team_goal_count: int  2 1 2 2 0 0 2 0 4 0 ...
##  $ away_team_goal_count: int  1 2 0 0 2 3 2 0 0 2 ...
##  $ total_goal_count    : int  3 3 2 2 2 3 4 0 4 2 ...
##  $ home_team_possession: int  44 44 62 56 60 40 57 48 61 40 ...
##  $ away_team_possession: int  56 56 38 44 40 60 43 52 39 60 ...
##  $ stadium             : Factor w/ 23 levels "Anfield","Cardiff City Stadium",..: 11 13 22 21 3 17 10 14 1 4 ...
df1 <- data.frame(utilisation, home_ppg,away_ppg,total_goal_count)
df1$total_goal_count<- as.numeric(df1$total_goal_count)
df1$utilisation<- as.numeric(df1$utilisation)

Objective

With the given data our research will adress the following questions -

Exploratory data analysis

Correlation Charts

library(sjlabelled)
library(sjmisc)
library(sjPlot)
library(sjstats)

q = cor(df1)
sjp.corr(df1, show.legend = TRUE)
## Warning: Removed 10 rows containing missing values (geom_text).

## Look at coeralation relations using SjPlot, we can see that Home_ppg and attendance are highly coorelated, by looking at the sign before the number and having mediem strenght being bigger 0.3 and have **. Intrestingly, Away_ppg and utilisation are highly coorelated, which means we can run a lm model on these variables to learn more about the relation.

Cowplots

Graphs for Home, away ppg, and total goal count looks very well. However, utilisation graph is not completely useless, and can be surely worked with, by centralizing the data.

Splitting the data 70:30

# split the data (70%)
library(caTools)
set.seed(100)
ind<- sample(2, nrow(df1), replace = TRUE, prob = c(0.7,0.3))
traindf <- df1[ind==1,]
testdf <- df1[ind==2,]  

Model 1: Linear Regression

library(snakecase)
fit1 <- lm(utilisation~home_ppg+away_ppg+total_goal_count, data=traindf)
sjPlot::tab_model(fit1)
  utilisation
Predictors Estimates CI p
(Intercept) 27.62 24.60 – 30.65 <0.001
home_ppg -0.86 -2.17 – 0.45 0.198
away_ppg 2.21 0.80 – 3.62 0.002
total_goal_count 0.29 -0.17 – 0.76 0.214
Observations 267
R2 / R2 adjusted 0.043 / 0.032

We have an r squared of .397 for model 1, which could be better, indicating that our model is only 40% significant, and a p-value of 0.001 meaning the relation is significant. Getting an negative interpect is not acceptable as that shows that more work has to be done on the modelling and cleaning the data perphaps.

Model 2: Logistic Regression

traindf$utilisation<- as.factor(traindf$utilisation)
loglm <- glm(utilisation~home_ppg+away_ppg+total_goal_count, data=traindf, family = binomial)
summary(loglm)
## 
## Call:
## glm(formula = utilisation ~ home_ppg + away_ppg + total_goal_count, 
##     family = binomial, data = traindf)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2722   0.0601   0.0758   0.0997   0.1705  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)        4.3212     3.6857   1.172    0.241
## home_ppg          -0.2933     1.8100  -0.162    0.871
## away_ppg           0.6281     2.0053   0.313    0.754
## total_goal_count   0.4230     0.7388   0.573    0.567
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13.171  on 266  degrees of freedom
## Residual deviance: 12.703  on 263  degrees of freedom
## AIC: 20.703
## 
## Number of Fisher Scoring iterations: 9

we see that home points per game and total goals scored influence stadium utilization positively, while away points per game have a slightly negative effect. Surprisingly the coefficients of all three variables are non-significant (p > 0.05). So, an increase in utilization by 1 unit increases the odds of home points per game by 0.75 and the total goals scored by 0.1. Whereas, away points per game decrease by 0.37 percent, this is an acceptable interpretation and falls in line with the research. Even though the p-values is non-significant and lower than the linear regression, the history of football

falls in line with this model.

Model 3: General Additive Model

library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.16.1
gam1<- gam(utilisation~ns(home_ppg, 4) + away_ppg+total_goal_count, data=df1)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
summary(gam1)
## 
## Call: gam(formula = utilisation ~ ns(home_ppg, 4) + away_ppg + total_goal_count, 
##     data = df1)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.015  -2.219   2.023   4.365   9.242 
## 
## (Dispersion Parameter for gaussian family taken to be 46.0598)
## 
##     Null Deviance: 18193.43 on 379 degrees of freedom
## Residual Deviance: 17180.31 on 373 degrees of freedom
## AIC: 2542.706 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                   Df  Sum Sq Mean Sq F value   Pr(>F)   
## ns(home_ppg, 4)    4   676.9  169.22  3.6739 0.005984 **
## away_ppg           1   333.5  333.46  7.2396 0.007452 **
## total_goal_count   1     2.8    2.78  0.0604 0.806055   
## Residuals        373 17180.3   46.06                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plot the prediction

plot(gam1, se=TRUE, col="blue")  # not an impressive representation"

par(mfrow=c(1,3)) # Show 3 outcomes 1 for each variable  year, age, education

ANOVA IN GAM

gam.V1 = gam(utilisation~s(home_ppg,5)+away_ppg+total_goal_count, data=df1)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
gam.V2 = gam(utilisation~home_ppg+s(away_ppg,5)+total_goal_count, data=df1)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
gam.V3 = gam(utilisation~ns(home_ppg,4)+s(away_ppg,5)+total_goal_count, data=df1)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
anova(gam.V1, gam.V2, gam.V3, test="F")
## Analysis of Deviance Table
## 
## Model 1: utilisation ~ s(home_ppg, 5) + away_ppg + total_goal_count
## Model 2: utilisation ~ home_ppg + s(away_ppg, 5) + total_goal_count
## Model 3: utilisation ~ ns(home_ppg, 4) + s(away_ppg, 5) + total_goal_count
##   Resid. Df Resid. Dev          Df Deviance          F Pr(>F)    
## 1       372      14659                                           
## 2       372      17417 -2.6242e-05 -2757.54 2.2636e+06 <2e-16 ***
## 3       369      17130  3.0000e+00   286.37 2.0562e+00 0.1056    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam.V2)
## 
## Call: gam(formula = utilisation ~ home_ppg + s(away_ppg, 5) + total_goal_count, 
##     data = df1)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.611  -2.464   1.699   4.661   7.679 
## 
## (Dispersion Parameter for gaussian family taken to be 46.8189)
## 
##     Null Deviance: 18193.43 on 379 degrees of freedom
## Residual Deviance: 17416.62 on 372 degrees of freedom
## AIC: 2549.897 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                   Df  Sum Sq Mean Sq F value   Pr(>F)   
## home_ppg           1   390.2  390.21  8.3345 0.004117 **
## s(away_ppg, 5)     1   337.1  337.11  7.2003 0.007614 **
## total_goal_count   1     0.3    0.30  0.0064 0.936145   
## Residuals        372 17416.6   46.82                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                  Npar Df Npar F Pr(F)
## (Intercept)                          
## home_ppg                             
## s(away_ppg, 5)         4 0.2624 0.902
## total_goal_count
plot(gam.V2, se=TRUE, col="blue")

plot(gam.V3, se=TRUE, col="red")

# There is a complelling evidence that the second model which uses a linear model for yearnis better than the other two

Make predictions with the best model

PredictGam = predict(gam.V2, newdata=df1)
PredictGam
##        1        2        3        4        5        6        7        8 
## 28.91086 31.13420 28.46681 28.61063 30.41122 31.66897 28.67151 29.16664 
##        9       10       11       12       13       14       15       16 
## 27.13346 30.05562 29.82422 27.12093 29.49080 28.17217 28.43679 28.41720 
##       17       18       19       20       21       22       23       24 
## 26.19959 29.87368 30.48069 32.23872 31.01486 30.18433 27.82808 30.03549 
##       25       26       27       28       29       30       31       32 
## 29.10151 26.32706 29.91631 30.26319 29.43875 30.19150 31.64458 29.37824 
##       33       34       35       36       37       38       39       40 
## 27.84721 27.45977 29.12570 28.42748 26.91717 30.33568 30.42733 30.96881 
##       41       42       43       44       45       46       47       48 
## 30.68534 25.71391 29.55961 27.39555 29.85356 31.45190 30.09780 28.22519 
##       49       50       51       52       53       54       55       56 
## 28.78732 28.99352 30.00349 28.49471 29.82422 29.14920 28.71348 26.51831 
##       57       58       59       60       61       62       63       64 
## 32.29208 31.31612 29.69713 27.67669 29.76828 27.91963 26.26215 27.45046 
##       65       66       67       68       69       70       71       72 
## 29.83578 32.52219 28.15439 30.32024 29.48462 29.73438 29.69320 27.67171 
##       73       74       75       76       77       78       79       80 
## 29.26565 30.08494 28.80645 28.80189 28.52391 30.25423 30.44511 29.32543 
##       81       82       83       84       85       86       87       88 
## 28.80904 26.55978 28.54882 30.58593 28.75823 28.72768 28.87887 33.18017 
##       89       90       91       92       93       94       95       96 
## 29.28659 28.13480 29.55960 26.43631 28.47692 29.29680 29.82032 29.36368 
##       97       98       99      100      101      102      103      104 
## 30.31789 30.46290 28.48833 30.64976 29.91588 28.03428 28.57874 29.63839 
##      105      106      107      108      109      110      111      112 
## 30.26453 30.05562 30.39246 26.50677 28.62754 29.59798 29.24035 28.81933 
##      113      114      115      116      117      118      119      120 
## 29.83811 28.94949 30.93484 31.54517 25.77881 27.90574 27.91963 27.76711 
##      121      122      123      124      125      126      127      128 
## 28.01902 29.08563 31.31505 31.64458 29.33237 30.01770 29.15635 29.50625 
##      129      130      131      132      133      134      135      136 
## 27.84721 29.61296 30.12051 26.45341 29.49080 29.46683 29.54685 30.19959 
##      137      138      139      140      141      142      143      144 
## 30.46290 26.73805 29.45100 26.96429 28.29500 28.33700 30.28140 31.64458 
##      145      146      147      148      149      150      151      152 
## 28.63593 32.00967 30.18309 29.48587 27.84265 28.92865 31.48044 26.87018 
##      153      154      155      156      157      158      159      160 
## 28.94016 27.32065 29.60457 29.36046 30.32024 30.95102 29.67397 28.91444 
##      161      162      163      164      165      166      167      168 
## 26.93495 27.87788 30.26453 28.70210 28.15439 30.80124 29.87637 30.07106 
##      169      170      171      172      173      174      175      176 
## 30.44511 27.84980 31.01486 27.31931 27.60340 28.42871 28.34605 29.14348 
##      177      178      179      180      181      182      183      184 
## 28.17441 30.37306 30.76309 30.44582 30.00349 27.86044 31.64458 29.22509 
##      185      186      187      188      189      190      191      192 
## 29.66632 27.71739 27.01765 30.01770 30.09780 29.72877 28.56713 28.63095 
##      193      194      195      196      197      198      199      200 
## 29.24786 29.57738 28.95405 27.47575 30.67416 29.71099 32.00967 28.04236 
##      201      202      203      204      205      206      207      208 
## 29.04068 26.52679 31.58074 27.42419 28.29890 29.28659 30.44386 29.38002 
##      209      210      211      212      213      214      215      216 
## 30.24540 29.29610 29.30529 28.78410 30.12051 28.40970 29.03549 31.95631 
##      217      218      219      220      221      222      223      224 
## 27.95910 28.15439 29.10299 27.16010 29.14740 29.61296 29.18175 27.85111 
##      225      226      227      228      229      230      231      232 
## 27.72166 28.81933 28.83191 28.50885 33.21574 31.48151 27.09536 29.19448 
##      233      234      235      236      237      238      239      240 
## 30.80124 28.76953 28.11316 31.80996 30.24583 29.93366 28.54934 27.40460 
##      241      242      243      244      245      246      247      248 
## 28.27084 28.65653 28.91444 29.11363 27.20595 29.80253 29.37824 30.06223 
##      249      250      251      252      253      254      255      256 
## 27.35748 31.27948 31.01486 30.61050 29.97560 29.01384 26.51831 29.23007 
##      257      258      259      260      261      262      263      264 
## 31.25934 29.23779 28.74673 27.80268 28.63593 27.73286 30.17387 31.31612 
##      265      266      267      268      269      270      271      272 
## 29.30887 28.62452 29.95188 27.19515 30.81390 28.62842 29.87758 31.04418 
##      273      274      275      276      277      278      279      280 
## 29.02029 27.26630 28.39191 26.99741 30.72752 27.27837 29.64447 28.71115 
##      281      282      283      284      285      286      287      288 
## 30.28140 31.42708 28.04236 28.78866 28.03681 28.90055 29.68818 30.61050 
##      289      290      291      292      293      294      295      296 
## 30.97929 29.24035 28.08017 31.31612 29.46661 29.97560 30.33748 27.17789 
##      297      298      299      300      301      302      303      304 
## 26.66026 28.18425 28.50885 30.03549 29.10151 28.20076 32.15727 29.46809 
##      305      306      307      308      309      310      311      312 
## 32.13949 28.76632 29.07106 29.83811 28.73127 29.09584 28.90055 30.70973 
##      313      314      315      316      317      318      319      320 
## 28.70302 27.67669 28.09796 29.48587 28.93928 26.30026 27.32187 32.00967 
##      321      322      323      324      325      326      327      328 
## 28.70855 30.04612 31.29492 29.04068 28.05713 29.23007 27.53547 28.99605 
##      329      330      331      332      333      334      335      336 
## 29.87368 29.76056 29.16699 28.60415 32.27429 27.81423 29.65260 28.99605 
##      337      338      339      340      341      342      343      344 
## 28.58476 27.84489 29.35865 27.96313 31.07975 28.96728 29.50366 28.38071 
##      345      346      347      348      349      350      351      352 
## 32.23872 27.58393 27.63361 28.76632 29.09404 30.84948 26.24671 28.38665 
##      353      354      355      356      357      358      359      360 
## 29.82422 29.18477 29.50858 29.14365 29.59517 29.68818 31.95631 29.30270 
##      361      362      363      364      365      366      367      368 
## 28.24298 30.75131 28.43679 27.41488 30.56381 31.84554 28.20203 31.65118 
##      369      370      371      372      373      374      375      376 
## 27.03947 27.30412 28.32420 30.04444 29.48497 30.05327 28.80645 27.87142 
##      377      378      379      380 
## 27.22501 29.41704 29.79613 32.02746
Adding Local regression fits by lo() models
gam.lo = gam(utilisation~s(home_ppg,df=4)+lo(away_ppg,span=0.5)+total_goal_count, data=df1)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
summary(gam.lo)
## 
## Call: gam(formula = utilisation ~ s(home_ppg, df = 4) + lo(away_ppg, 
##     span = 0.5) + total_goal_count, data = df1)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.239  -2.266   2.046   4.163   9.302 
## 
## (Dispersion Parameter for gaussian family taken to be 43.3167)
## 
##     Null Deviance: 18193.43 on 379 degrees of freedom
## Residual Deviance: 16009.92 on 369.6013 degrees of freedom
## AIC: 2522.692 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                             Df  Sum Sq Mean Sq F value   Pr(>F)   
## s(home_ppg, df = 4)        1.0   390.3  390.34  9.0112 0.002866 **
## lo(away_ppg, span = 0.5)   1.0   329.1  329.14  7.5984 0.006132 **
## total_goal_count           1.0     3.0    3.01  0.0694 0.792380   
## Residuals                369.6 16009.9   43.32                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                          Npar Df Npar F     Pr(F)    
## (Intercept)                                          
## s(home_ppg, df = 4)          3.0 11.200 4.759e-07 ***
## lo(away_ppg, span = 0.5)     3.4 -0.002         1    
## total_goal_count                                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Discussion of Results from GAM Model’s

The General Additive Model (GAM) is a strong statistical tool which incorporates the best aspects of all models to attain the lowest error rate. This method flexible nonlinearities for our independent variables, while not giving up the additive structure of linear models. For this method, three different GAM models are utilized to regress the independent variables (home points per game, away points per game, and total goals) based on-

Conclusion

Many studies have been done on the causation and analysis of stadium attendances throughout the world, yet not much has been done to use different statistical models apart from the linear regressions. This research helps to further the ideas of using different models to understand the stadium utilization rates with respect to points per game. Moreover, there is evidence to prove that home points per game are the most important factor in-stadium utilization, holding every other factor constant. The methods utilized in the study have the potential to improve the predictive nature of football stadium utilization rates for future projects. Furthermore, due to the reliance on statistical predictions that depend heavily on the number of observations used in a model, that is one area where this research could improve upon.