The data utilized in this research comes from <whoscored.com> and we will try to focus on the three variables namely; Home & away points per game, and the total goals scored.
data <- read.csv(file = "~/Desktop/612 data.csv")
attach(data)
summary(data)
## attendance capacity utilisation home_team_name
## 10792 : 2 Min. :11329 0.99 : 86 AFC Bournemouth : 19
## 24263 : 2 1st Qu.:25960 0.98 : 52 Arsenal : 19
## 40491 : 2 Median :33280 1 : 35 Brighton & Hove Albion: 19
## 52908 : 2 Mean :42086 0.97 : 32 Burnley : 19
## 74519 : 2 3rd Qu.:55556 0.95 : 29 Cardiff City : 19
## 74523 : 2 Max. :90000 0.92 : 19 Chelsea : 19
## (Other):368 (Other):127 (Other) :266
## away_team_name home_ID Away_ID home_ppg
## AFC Bournemouth : 19 Min. : 1.00 Min. : 1.0 Min. :0.470
## Arsenal : 19 1st Qu.: 5.75 1st Qu.: 6.5 1st Qu.:1.210
## Brighton & Hove Albion: 19 Median :10.50 Median :11.5 Median :1.475
## Burnley : 19 Mean :10.50 Mean :11.0 Mean :1.615
## Cardiff City : 19 3rd Qu.:15.25 3rd Qu.:16.0 3rd Qu.:1.917
## Chelsea : 19 Max. :20.00 Max. :20.0 Max. :2.840
## (Other) :266
## away_ppg home_team_goal_count away_team_goal_count total_goal_count
## Min. :0.260 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.840 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:2.000
## Median :1.160 Median :1.000 Median :1.000 Median :3.000
## Mean :1.198 Mean :1.568 Mean :1.253 Mean :2.821
## 3rd Qu.:1.542 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :2.320 Max. :6.000 Max. :6.000 Max. :8.000
##
## home_team_possession away_team_possession stadium
## Min. :23.00 Min. :22.00 Anfield : 19
## 1st Qu.:43.00 1st Qu.:41.00 Cardiff City Stadium: 19
## Median :52.00 Median :48.00 Craven Cottage : 19
## Mean :51.51 Mean :48.49 Emirates Stadium : 19
## 3rd Qu.:59.00 3rd Qu.:57.00 Etihad Stadium : 19
## Max. :78.00 Max. :77.00 Goodison Park : 19
## (Other) :266
str(data)
## 'data.frame': 380 obs. of 15 variables:
## $ attendance : Factor w/ 373 levels "10199","10227",..: 354 261 4 30 84 75 164 155 278 336 ...
## $ capacity : int 74879 52338 11329 21577 25700 24500 32050 32384 54074 60704 ...
## $ utilisation : Factor w/ 36 levels "#VALUE!","0.32",..: 34 34 26 28 32 33 32 30 33 34 ...
## $ home_team_name : Factor w/ 20 levels "AFC Bournemouth",..: 14 15 1 18 9 10 20 16 12 2 ...
## $ away_team_name : Factor w/ 20 levels "AFC Bournemouth",..: 11 17 5 3 7 6 8 4 19 13 ...
## $ home_ID : int 2 15 14 12 16 19 17 11 3 5 ...
## $ Away_ID : int 8 16 18 20 9 4 7 13 10 1 ...
## $ home_ppg : num 1.89 1.32 1.53 1.42 1.11 0.47 1.79 1.21 2.79 2.37 ...
## $ away_ppg : num 1.32 1.74 0.74 0.68 1.53 1.58 1.05 0.89 1.11 2.32 ...
## $ home_team_goal_count: int 2 1 2 2 0 0 2 0 4 0 ...
## $ away_team_goal_count: int 1 2 0 0 2 3 2 0 0 2 ...
## $ total_goal_count : int 3 3 2 2 2 3 4 0 4 2 ...
## $ home_team_possession: int 44 44 62 56 60 40 57 48 61 40 ...
## $ away_team_possession: int 56 56 38 44 40 60 43 52 39 60 ...
## $ stadium : Factor w/ 23 levels "Anfield","Cardiff City Stadium",..: 11 13 22 21 3 17 10 14 1 4 ...
df1 <- data.frame(utilisation, home_ppg,away_ppg,total_goal_count)
df1$total_goal_count<- as.numeric(df1$total_goal_count)
df1$utilisation<- as.numeric(df1$utilisation)
With the given data our research will adress the following questions -
library(sjlabelled)
library(sjmisc)
library(sjPlot)
library(sjstats)
q = cor(df1)
sjp.corr(df1, show.legend = TRUE)
## Warning: Removed 10 rows containing missing values (geom_text).
## Look at coeralation relations using SjPlot, we can see that Home_ppg and attendance are highly coorelated, by looking at the sign before the number and having mediem strenght being bigger 0.3 and have **. Intrestingly, Away_ppg and utilisation are highly coorelated, which means we can run a lm model on these variables to learn more about the relation.
# split the data (70%)
library(caTools)
set.seed(100)
ind<- sample(2, nrow(df1), replace = TRUE, prob = c(0.7,0.3))
traindf <- df1[ind==1,]
testdf <- df1[ind==2,]
library(snakecase)
fit1 <- lm(utilisation~home_ppg+away_ppg+total_goal_count, data=traindf)
sjPlot::tab_model(fit1)
| utilisation | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 27.62 | 24.60 – 30.65 | <0.001 |
| home_ppg | -0.86 | -2.17 – 0.45 | 0.198 |
| away_ppg | 2.21 | 0.80 – 3.62 | 0.002 |
| total_goal_count | 0.29 | -0.17 – 0.76 | 0.214 |
| Observations | 267 | ||
| R2 / R2 adjusted | 0.043 / 0.032 | ||
traindf$utilisation<- as.factor(traindf$utilisation)
loglm <- glm(utilisation~home_ppg+away_ppg+total_goal_count, data=traindf, family = binomial)
summary(loglm)
##
## Call:
## glm(formula = utilisation ~ home_ppg + away_ppg + total_goal_count,
## family = binomial, data = traindf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2722 0.0601 0.0758 0.0997 0.1705
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.3212 3.6857 1.172 0.241
## home_ppg -0.2933 1.8100 -0.162 0.871
## away_ppg 0.6281 2.0053 0.313 0.754
## total_goal_count 0.4230 0.7388 0.573 0.567
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13.171 on 266 degrees of freedom
## Residual deviance: 12.703 on 263 degrees of freedom
## AIC: 20.703
##
## Number of Fisher Scoring iterations: 9
falls in line with this model.
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.16.1
gam1<- gam(utilisation~ns(home_ppg, 4) + away_ppg+total_goal_count, data=df1)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
summary(gam1)
##
## Call: gam(formula = utilisation ~ ns(home_ppg, 4) + away_ppg + total_goal_count,
## data = df1)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -29.015 -2.219 2.023 4.365 9.242
##
## (Dispersion Parameter for gaussian family taken to be 46.0598)
##
## Null Deviance: 18193.43 on 379 degrees of freedom
## Residual Deviance: 17180.31 on 373 degrees of freedom
## AIC: 2542.706
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## ns(home_ppg, 4) 4 676.9 169.22 3.6739 0.005984 **
## away_ppg 1 333.5 333.46 7.2396 0.007452 **
## total_goal_count 1 2.8 2.78 0.0604 0.806055
## Residuals 373 17180.3 46.06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(gam1, se=TRUE, col="blue") # not an impressive representation"
par(mfrow=c(1,3)) # Show 3 outcomes 1 for each variable year, age, education
gam.V1 = gam(utilisation~s(home_ppg,5)+away_ppg+total_goal_count, data=df1)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
gam.V2 = gam(utilisation~home_ppg+s(away_ppg,5)+total_goal_count, data=df1)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
gam.V3 = gam(utilisation~ns(home_ppg,4)+s(away_ppg,5)+total_goal_count, data=df1)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
anova(gam.V1, gam.V2, gam.V3, test="F")
## Analysis of Deviance Table
##
## Model 1: utilisation ~ s(home_ppg, 5) + away_ppg + total_goal_count
## Model 2: utilisation ~ home_ppg + s(away_ppg, 5) + total_goal_count
## Model 3: utilisation ~ ns(home_ppg, 4) + s(away_ppg, 5) + total_goal_count
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 372 14659
## 2 372 17417 -2.6242e-05 -2757.54 2.2636e+06 <2e-16 ***
## 3 369 17130 3.0000e+00 286.37 2.0562e+00 0.1056
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam.V2)
##
## Call: gam(formula = utilisation ~ home_ppg + s(away_ppg, 5) + total_goal_count,
## data = df1)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -29.611 -2.464 1.699 4.661 7.679
##
## (Dispersion Parameter for gaussian family taken to be 46.8189)
##
## Null Deviance: 18193.43 on 379 degrees of freedom
## Residual Deviance: 17416.62 on 372 degrees of freedom
## AIC: 2549.897
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## home_ppg 1 390.2 390.21 8.3345 0.004117 **
## s(away_ppg, 5) 1 337.1 337.11 7.2003 0.007614 **
## total_goal_count 1 0.3 0.30 0.0064 0.936145
## Residuals 372 17416.6 46.82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## home_ppg
## s(away_ppg, 5) 4 0.2624 0.902
## total_goal_count
plot(gam.V2, se=TRUE, col="blue")
plot(gam.V3, se=TRUE, col="red")
# There is a complelling evidence that the second model which uses a linear model for yearnis better than the other two
PredictGam = predict(gam.V2, newdata=df1)
PredictGam
## 1 2 3 4 5 6 7 8
## 28.91086 31.13420 28.46681 28.61063 30.41122 31.66897 28.67151 29.16664
## 9 10 11 12 13 14 15 16
## 27.13346 30.05562 29.82422 27.12093 29.49080 28.17217 28.43679 28.41720
## 17 18 19 20 21 22 23 24
## 26.19959 29.87368 30.48069 32.23872 31.01486 30.18433 27.82808 30.03549
## 25 26 27 28 29 30 31 32
## 29.10151 26.32706 29.91631 30.26319 29.43875 30.19150 31.64458 29.37824
## 33 34 35 36 37 38 39 40
## 27.84721 27.45977 29.12570 28.42748 26.91717 30.33568 30.42733 30.96881
## 41 42 43 44 45 46 47 48
## 30.68534 25.71391 29.55961 27.39555 29.85356 31.45190 30.09780 28.22519
## 49 50 51 52 53 54 55 56
## 28.78732 28.99352 30.00349 28.49471 29.82422 29.14920 28.71348 26.51831
## 57 58 59 60 61 62 63 64
## 32.29208 31.31612 29.69713 27.67669 29.76828 27.91963 26.26215 27.45046
## 65 66 67 68 69 70 71 72
## 29.83578 32.52219 28.15439 30.32024 29.48462 29.73438 29.69320 27.67171
## 73 74 75 76 77 78 79 80
## 29.26565 30.08494 28.80645 28.80189 28.52391 30.25423 30.44511 29.32543
## 81 82 83 84 85 86 87 88
## 28.80904 26.55978 28.54882 30.58593 28.75823 28.72768 28.87887 33.18017
## 89 90 91 92 93 94 95 96
## 29.28659 28.13480 29.55960 26.43631 28.47692 29.29680 29.82032 29.36368
## 97 98 99 100 101 102 103 104
## 30.31789 30.46290 28.48833 30.64976 29.91588 28.03428 28.57874 29.63839
## 105 106 107 108 109 110 111 112
## 30.26453 30.05562 30.39246 26.50677 28.62754 29.59798 29.24035 28.81933
## 113 114 115 116 117 118 119 120
## 29.83811 28.94949 30.93484 31.54517 25.77881 27.90574 27.91963 27.76711
## 121 122 123 124 125 126 127 128
## 28.01902 29.08563 31.31505 31.64458 29.33237 30.01770 29.15635 29.50625
## 129 130 131 132 133 134 135 136
## 27.84721 29.61296 30.12051 26.45341 29.49080 29.46683 29.54685 30.19959
## 137 138 139 140 141 142 143 144
## 30.46290 26.73805 29.45100 26.96429 28.29500 28.33700 30.28140 31.64458
## 145 146 147 148 149 150 151 152
## 28.63593 32.00967 30.18309 29.48587 27.84265 28.92865 31.48044 26.87018
## 153 154 155 156 157 158 159 160
## 28.94016 27.32065 29.60457 29.36046 30.32024 30.95102 29.67397 28.91444
## 161 162 163 164 165 166 167 168
## 26.93495 27.87788 30.26453 28.70210 28.15439 30.80124 29.87637 30.07106
## 169 170 171 172 173 174 175 176
## 30.44511 27.84980 31.01486 27.31931 27.60340 28.42871 28.34605 29.14348
## 177 178 179 180 181 182 183 184
## 28.17441 30.37306 30.76309 30.44582 30.00349 27.86044 31.64458 29.22509
## 185 186 187 188 189 190 191 192
## 29.66632 27.71739 27.01765 30.01770 30.09780 29.72877 28.56713 28.63095
## 193 194 195 196 197 198 199 200
## 29.24786 29.57738 28.95405 27.47575 30.67416 29.71099 32.00967 28.04236
## 201 202 203 204 205 206 207 208
## 29.04068 26.52679 31.58074 27.42419 28.29890 29.28659 30.44386 29.38002
## 209 210 211 212 213 214 215 216
## 30.24540 29.29610 29.30529 28.78410 30.12051 28.40970 29.03549 31.95631
## 217 218 219 220 221 222 223 224
## 27.95910 28.15439 29.10299 27.16010 29.14740 29.61296 29.18175 27.85111
## 225 226 227 228 229 230 231 232
## 27.72166 28.81933 28.83191 28.50885 33.21574 31.48151 27.09536 29.19448
## 233 234 235 236 237 238 239 240
## 30.80124 28.76953 28.11316 31.80996 30.24583 29.93366 28.54934 27.40460
## 241 242 243 244 245 246 247 248
## 28.27084 28.65653 28.91444 29.11363 27.20595 29.80253 29.37824 30.06223
## 249 250 251 252 253 254 255 256
## 27.35748 31.27948 31.01486 30.61050 29.97560 29.01384 26.51831 29.23007
## 257 258 259 260 261 262 263 264
## 31.25934 29.23779 28.74673 27.80268 28.63593 27.73286 30.17387 31.31612
## 265 266 267 268 269 270 271 272
## 29.30887 28.62452 29.95188 27.19515 30.81390 28.62842 29.87758 31.04418
## 273 274 275 276 277 278 279 280
## 29.02029 27.26630 28.39191 26.99741 30.72752 27.27837 29.64447 28.71115
## 281 282 283 284 285 286 287 288
## 30.28140 31.42708 28.04236 28.78866 28.03681 28.90055 29.68818 30.61050
## 289 290 291 292 293 294 295 296
## 30.97929 29.24035 28.08017 31.31612 29.46661 29.97560 30.33748 27.17789
## 297 298 299 300 301 302 303 304
## 26.66026 28.18425 28.50885 30.03549 29.10151 28.20076 32.15727 29.46809
## 305 306 307 308 309 310 311 312
## 32.13949 28.76632 29.07106 29.83811 28.73127 29.09584 28.90055 30.70973
## 313 314 315 316 317 318 319 320
## 28.70302 27.67669 28.09796 29.48587 28.93928 26.30026 27.32187 32.00967
## 321 322 323 324 325 326 327 328
## 28.70855 30.04612 31.29492 29.04068 28.05713 29.23007 27.53547 28.99605
## 329 330 331 332 333 334 335 336
## 29.87368 29.76056 29.16699 28.60415 32.27429 27.81423 29.65260 28.99605
## 337 338 339 340 341 342 343 344
## 28.58476 27.84489 29.35865 27.96313 31.07975 28.96728 29.50366 28.38071
## 345 346 347 348 349 350 351 352
## 32.23872 27.58393 27.63361 28.76632 29.09404 30.84948 26.24671 28.38665
## 353 354 355 356 357 358 359 360
## 29.82422 29.18477 29.50858 29.14365 29.59517 29.68818 31.95631 29.30270
## 361 362 363 364 365 366 367 368
## 28.24298 30.75131 28.43679 27.41488 30.56381 31.84554 28.20203 31.65118
## 369 370 371 372 373 374 375 376
## 27.03947 27.30412 28.32420 30.04444 29.48497 30.05327 28.80645 27.87142
## 377 378 379 380
## 27.22501 29.41704 29.79613 32.02746
gam.lo = gam(utilisation~s(home_ppg,df=4)+lo(away_ppg,span=0.5)+total_goal_count, data=df1)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
summary(gam.lo)
##
## Call: gam(formula = utilisation ~ s(home_ppg, df = 4) + lo(away_ppg,
## span = 0.5) + total_goal_count, data = df1)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -29.239 -2.266 2.046 4.163 9.302
##
## (Dispersion Parameter for gaussian family taken to be 43.3167)
##
## Null Deviance: 18193.43 on 379 degrees of freedom
## Residual Deviance: 16009.92 on 369.6013 degrees of freedom
## AIC: 2522.692
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(home_ppg, df = 4) 1.0 390.3 390.34 9.0112 0.002866 **
## lo(away_ppg, span = 0.5) 1.0 329.1 329.14 7.5984 0.006132 **
## total_goal_count 1.0 3.0 3.01 0.0694 0.792380
## Residuals 369.6 16009.9 43.32
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(home_ppg, df = 4) 3.0 11.200 4.759e-07 ***
## lo(away_ppg, span = 0.5) 3.4 -0.002 1
## total_goal_count
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The General Additive Model (GAM) is a strong statistical tool which incorporates the best aspects of all models to attain the lowest error rate. This method flexible nonlinearities for our independent variables, while not giving up the additive structure of linear models. For this method, three different GAM models are utilized to regress the independent variables (home points per game, away points per game, and total goals) based on-
Many studies have been done on the causation and analysis of stadium attendances throughout the world, yet not much has been done to use different statistical models apart from the linear regressions. This research helps to further the ideas of using different models to understand the stadium utilization rates with respect to points per game. Moreover, there is evidence to prove that home points per game are the most important factor in-stadium utilization, holding every other factor constant. The methods utilized in the study have the potential to improve the predictive nature of football stadium utilization rates for future projects. Furthermore, due to the reliance on statistical predictions that depend heavily on the number of observations used in a model, that is one area where this research could improve upon.