Loading [MathJax]/jax/output/HTML-CSS/jax.js

Introduction

This is hopefully the first in a weekly series of forecasts.

Using data from the very first recorded match in 1877 through to matches this week, collected from http://www.socerbase.com, I plan to develop a forecasting model to predict match outcomes.  This first week’s attempt will undoubtedly be rudimentary, and contain many errors, but nonetheless it is important to start making forecasts, in order to improve them.

First the data is loaded (which can be accessed here), then the forecasts are constructed (those matches can be found here). Subsequently, the forecasts are reported, and after this an exercise in ex post forecast accuracy is carried out, with the model being re-estimated each week over the previous year and being used to forecast matches in the subsequent week, and finally a table is presented with details on all the matches forecast, and the forecast outcomes.

Loading the Data

The forecasts are now loaded up (the code that creates them involves calculating Elo scores and league tables and hence takes some time to run):

library(knitr)
wd <- "/home/readejj/Dropbox/Teaching/Reading/ec313/2015/Football-forecasts/"
forecast.matches <- read.csv(paste(wd,"forecasts_",Sys.Date(),".csv",sep=""))
forecast.matches <- forecast.matches[is.na(forecast.matches$outcome)==F,]

The Forecast Model

The forecasts were generated elsewhere using a model based on team Elo scores, league positions (pos), points amassed (pts), goals scored (gs), goal differences (gd), matches played in a year (pld), and recent form (form).  All variables are entered with the level for the home team (1), and the difference between the home team and away team (.D, in order to reduce potential collinearity), plus the difference squared (.D.2).  A time trend is also added in order to pick up any pattern between years in terms of home advantage.  A simple improvement over this would be to include dummies for each year in order to allow non-linear variation in home advantage.

The model is estimated via Ordinary Least Squares on the discrete dependent variable: yit={0if away win,0.5if draw,1if home win. The resulting forecasts thus may fall outside the unit interval, and the errors will be heteroskedastic. Nonetheless, it is a reasonable first model; the task of automating data collection so that the estimation period is fully up to date has been an important first step.

The model is estimated here and reported:

res.eng <- read.csv(paste(wd,"historical_",Sys.Date(),".csv",sep=""))
model <- lm(outcome ~ E.1 + pts1 + pts.D + pts.D.2 + pld1 + pld.D + pld.D.2 + gs1 + gs.D + gs.D.2 
            + gd1 + gd.D + gd.D.2 
            + pos1 + pos.D + pos.D.2 + form1 + form.D + form.D.2 + tier1 + tier.D + tier.D.2 + season.d,
            data=res.eng)
summary(model)
## 
## Call:
## lm(formula = outcome ~ E.1 + pts1 + pts.D + pts.D.2 + pld1 + 
##     pld.D + pld.D.2 + gs1 + gs.D + gs.D.2 + gd1 + gd.D + gd.D.2 + 
##     pos1 + pos.D + pos.D.2 + form1 + form.D + form.D.2 + tier1 + 
##     tier.D + tier.D.2 + season.d, data = res.eng)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0175 -0.2930  0.1394  0.3497  0.8446 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.016e-01  7.143e-03  56.222  < 2e-16 ***
## E.1          4.047e-01  1.098e-02  36.864  < 2e-16 ***
## pts1         1.034e-03  4.309e-04   2.400  0.01639 *  
## pts.D       -2.852e-03  3.154e-04  -9.043  < 2e-16 ***
## pts.D.2     -1.375e-05  6.576e-06  -2.091  0.03653 *  
## pld1        -1.714e-03  6.194e-04  -2.767  0.00565 ** 
## pld.D        3.385e-03  7.240e-04   4.676 2.93e-06 ***
## pld.D.2     -4.603e-05  3.132e-05  -1.470  0.14158    
## gs1          5.092e-04  1.737e-04   2.932  0.00337 ** 
## gs.D        -3.284e-05  1.552e-04  -0.212  0.83238    
## gs.D.2      -1.610e-06  4.766e-06  -0.338  0.73556    
## gd1         -6.829e-04  2.446e-04  -2.792  0.00524 ** 
## gd.D         3.427e-03  1.785e-04  19.201  < 2e-16 ***
## gd.D.2      -5.695e-06  2.381e-06  -2.392  0.01674 *  
## pos1         7.893e-04  3.053e-04   2.585  0.00973 ** 
## pos.D       -4.206e-04  2.584e-04  -1.628  0.10355    
## pos.D.2      3.573e-05  1.189e-05   3.004  0.00267 ** 
## form1        7.773e-04  3.574e-04   2.175  0.02966 *  
## form.D      -2.158e-03  3.345e-04  -6.451 1.12e-10 ***
## form.D.2    -7.858e-05  3.045e-05  -2.581  0.00985 ** 
## tier1        1.978e-03  7.784e-04   2.541  0.01104 *  
## tier.D      -5.403e-02  3.173e-03 -17.027  < 2e-16 ***
## tier.D.2    -5.891e-03  1.278e-03  -4.612 4.00e-06 ***
## season.d    -1.103e-03  3.134e-05 -35.199  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4008 on 215583 degrees of freedom
##   (37937 observations deleted due to missingness)
## Multiple R-squared:  0.05675,    Adjusted R-squared:  0.05665 
## F-statistic:   564 on 23 and 215583 DF,  p-value: < 2.2e-16

The resulting forecasts are between zero and one, and are effectively a probability of the home team winning. Thus, a forecast less than 50% suggests that the away team is more likely to win, while a forecast around 50% implies a draw is a quite likely result. The disappointing aspect of these forecasts is that there is not a huge amount of variation, with the entire range of them being 0.4655109.

The Forecasts

First, our Premier League forecasts:

prem.matches <- forecast.matches[forecast.matches$division=="English Premier",]
prem.matches$id <- 1:NROW(prem.matches)
par(mar=c(9,4,4,5)+.1)
plot(prem.matches$id,prem.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend Premier League Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=prem.matches$id,labels=paste(prem.matches$team1,prem.matches$team2,sep=" v "),las=2,cex.axis=0.65)

Hence Man United, Arsenal, Southampton and Stoke are all expected to gain home wins, with the forecast at 0.7, while Chelsea have a 63% probability of overcoming Man City (note that this model does not factor in team news such as the suspension of Diego Costa, nor the injury for Stoke of Bojan).

Next, our Championship forecasts:

champ.matches <- forecast.matches[forecast.matches$division=="English Championship",]
champ.matches$id <- 1:NROW(champ.matches)
par(mar=c(9,4,4,5)+.1)
plot(champ.matches$id,champ.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend Championship Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=champ.matches$id,labels=paste(champ.matches$team1,champ.matches$team2,sep=" v "),las=2,cex.axis=0.65)

There is a greater range of probabalistic forecasts for the Championship relative to the Premier League, with Blackpool and Cardiff only at just above 40% to beat Brighton and Derby, respectively.

Next, our League One forecasts:

lg1.matches <- forecast.matches[forecast.matches$division=="English League One",]
lg1.matches$id <- 1:NROW(lg1.matches)
par(mar=c(9,4,4,5)+.1)
plot(lg1.matches$id,lg1.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend League One Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=lg1.matches$id,labels=paste(lg1.matches$team1,lg1.matches$team2,sep=" v "),las=2,cex.axis=0.65)

Next, our League Two forecasts:

lg2.matches <- forecast.matches[forecast.matches$division=="English League Two",]
lg2.matches$id <- 1:NROW(lg2.matches)
par(mar=c(9,4,4,5)+.1)
plot(lg2.matches$id,lg2.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend League Two Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=lg2.matches$id,labels=paste(lg2.matches$team1,lg2.matches$team2,sep=" v "),las=2,cex.axis=0.65)

Next, our Football Conference forecasts:

conf.matches <- forecast.matches[forecast.matches$division=="Football Conference",]
conf.matches$id <- 1:NROW(conf.matches)
par(mar=c(9,4,4,5)+.1)
plot(conf.matches$id,conf.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend Football Conference Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=conf.matches$id,labels=paste(conf.matches$team1,conf.matches$team2,sep=" v "),las=2,cex.axis=0.65)

Finally, there are a number of FA Cup replays still to be played:

facup.matches <- forecast.matches[forecast.matches$division=="English FA Cup",]
facup.matches$id <- 1:NROW(facup.matches)
par(mar=c(9,4,4,5)+.1)
plot(facup.matches$id,facup.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend English FA Cup Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=facup.matches$id,labels=paste(facup.matches$team1,facup.matches$team2,sep=" v "),las=2,cex.axis=0.65)

The likelihood of Man United beating Cambridge in their replay at Old Trafford, at 85%, is the largest probability of a home win in all the forecasts produced for this week.  Similarly, Bolton’s chances of beating Liverpool in their replay, at 39%, is the smallest probability of a home victory, showing the impact of the divisional difference variable in the regression model.

Training forecasts and Mincer-Zarnowitz Testing

In this section, we run regressions over the previous calendar year, week-by-week, and consider the quality of these forecasts against actual outcomes.  We use a Mincer-Zarnowitz regression to do so, namely: yit=α+βˆyit+eit, so we regress outcomes on forecasts.  The test of forecast accuracy is that α=0 and β=1, namely that on average, or in expectation, our forecasts are equal to outcomes, and there is no bias.

test.start <- seq(Sys.Date()-365,Sys.Date(),by="weeks")
test.end <- seq(Sys.Date()-365,Sys.Date(),by="weeks")+6
test.outcomes <- data.frame()
for(i in 1:NROW(test.start)) {
#  print(i)
  training.data <- res.eng[res.eng$date<test.start[i],]
  test.data <- res.eng[res.eng$date>=test.start[i] & res.eng$date<=test.end[i],]
  if(NROW(test.data)>0){
    model <- lm(outcome ~ E.1 + pts1 + pts.D + pts.D.2 + pld1 + pld.D + pld.D.2 + gs1 + gs.D + gs.D.2 
                + gd1 + gd.D + gd.D.2 
                + pos1 + pos.D + pos.D.2 + form1 + form.D + form.D.2 + tier1 + tier.D + tier.D.2 + season.d,
                data=training.data)
    test.data$"(Intercept)" <- 1
    test.data$forecast <- as.matrix(test.data[,variable.names(model)]) %*% as.numeric(model$coefficients)
    test.outcomes <- rbind(test.outcomes,test.data[,c("match_id","team1","outcome","team2","forecast")])  
  }
}
#mincer-zarnowitz regression
mz <- lm(outcome ~ forecast,data=test.outcomes)
summary(mz)
## 
## Call:
## lm(formula = outcome ~ forecast, data = test.outcomes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86331 -0.44177  0.00247  0.38784  0.86931 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.08919    0.03815  -2.338   0.0194 *  
## forecast     1.12728    0.06460  17.450   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4144 on 4784 degrees of freedom
##   (4093 observations deleted due to missingness)
## Multiple R-squared:  0.05984,    Adjusted R-squared:  0.05964 
## F-statistic: 304.5 on 1 and 4784 DF,  p-value: < 2.2e-16
calib <- aggregate(test.outcomes$outcome,by=list(round(test.outcomes$forecast,2)),FUN=mean,na.rm=T)
plot(calib$Group.1,calib$x,xlim=range(0,1),ylim=range(0,1),main="Calibration of Forecasts, Graphically",
     ylab="% of time match forecast turned out as home win",xlab="Forecast probability of home win")
abline(0,1)

The regression summary suggests that the model is not particularly bad, with an α coefficient barely significantly different from zero, and via a t-test similarly the β coefficient is insignificantly different from 1.  Similarly, the calibration plot, which plots the frequency with which matches forecast in particular intervals turn out as predicted, suggests that the model is reasonably accurate. In the plot, we would hope to find the scatter points around the 45-degree line, since that represents that matches forecast to end as a home win with a probability of x% turn out as home wins x% of the time.  What we find is that, actually, our model exhibits favourite-longshot bias, namely that it under-predicts favourites to win (hence points above the 45-degree line nearer to 1), and over-predicts outsiders (points below the line nearer 0).  This bias is commonly found amongst bookmaker prices.

List of all forecasts

For transparency, all forecasts are also listed as a table:

kable(forecast.matches[order(forecast.matches$date,forecast.matches$division),
                       c("date","division","team1","outcome","team2")])
date division team1 outcome team2
1 2015-01-30 English Championship Bournemouth 0.6279495 Watford
66 2015-01-31 Conference North Boston Utd 0.5840113 Stockport
72 2015-01-31 Conference North Stalybridge 0.6576880 Bradford PA
58 2015-01-31 Conference South Sutton Utd 0.7275668 Farnborough
64 2015-01-31 Conference South Eastbourne 0.5530349 Bath City
10 2015-01-31 English Championship Blackpool 0.4433442 Brighton
11 2015-01-31 English Championship Huddersfield 0.5968631 Leeds
12 2015-01-31 English Championship Nottm Forest 0.6344475 Millwall
13 2015-01-31 English Championship Cardiff 0.4264369 Derby
14 2015-01-31 English Championship Blackburn 0.5970344 Fulham
15 2015-01-31 English Championship Reading 0.5624859 Sheff Wed
16 2015-01-31 English Championship Brentford 0.4910053 Middlesbro
17 2015-01-31 English Championship Charlton 0.5651828 Rotherham
18 2015-01-31 English Championship Ipswich 0.7287550 Wigan
19 2015-01-31 English Championship Birmingham 0.5086522 Norwich
20 2015-01-31 English Championship Bolton 0.5816187 Wolves
21 2015-01-31 English League One Bradford 0.6847951 Colchester
22 2015-01-31 English League One Coventry 0.4840699 Rochdale
23 2015-01-31 English League One Crewe 0.4144929 MK Dons
24 2015-01-31 English League One Sheff Utd 0.4984990 Swindon
25 2015-01-31 English League One Crawley 0.4106645 Preston
26 2015-01-31 English League One Oldham 0.5990283 Notts Co
27 2015-01-31 English League One Chesterfield 0.6334792 Doncaster
28 2015-01-31 English League One Leyton Orient 0.5330470 Scunthorpe
29 2015-01-31 English League One Barnsley 0.5891384 Port Vale
30 2015-01-31 English League One Peterborough 0.6091395 Yeovil
31 2015-01-31 English League Two Southend 0.6375144 York
32 2015-01-31 English League Two Wycombe 0.6968401 Portsmouth
33 2015-01-31 English League Two Dag & Red 0.6054083 Cheltenham
34 2015-01-31 English League Two Exeter 0.5762143 Tranmere
35 2015-01-31 English League Two Burton 0.6433355 Bury
36 2015-01-31 English League Two Carlisle 0.5772215 Mansfield
37 2015-01-31 English League Two Stevenage 0.6457155 Oxford
38 2015-01-31 English League Two Luton 0.5846713 Cambridge U
39 2015-01-31 English League Two Newport Co 0.5012429 Shrewsbury
40 2015-01-31 English League Two Accrington 0.5412007 Northampton
41 2015-01-31 English League Two Morecambe 0.5694802 AFC W’bledon
42 2015-01-31 English League Two Hartlepool 0.4174981 Plymouth
2 2015-01-31 English Premier Chelsea 0.6274118 Man City
3 2015-01-31 English Premier Liverpool 0.5916587 West Ham
4 2015-01-31 English Premier Hull 0.5251198 Newcastle
5 2015-01-31 English Premier C Palace 0.6035861 Everton
6 2015-01-31 English Premier Man Utd 0.7089077 Leicester
7 2015-01-31 English Premier Stoke 0.6933286 QPR
8 2015-01-31 English Premier Sunderland 0.5842300 Burnley
9 2015-01-31 English Premier West Brom 0.5017794 Tottenham
107 2015-01-31 Evo-Stik S Premier Weymouth 0.6813063 Histon
43 2015-01-31 Football Conference Altrincham 0.6633193 Aldershot
44 2015-01-31 Football Conference Dartford 0.4038861 Bristol R
45 2015-01-31 Football Conference Braintree 0.5103006 Macclesfield
46 2015-01-31 Football Conference Halifax 0.4962238 Barnet
47 2015-01-31 Football Conference Wrexham 0.5488996 Torquay
48 2015-01-31 Football Conference Forest Green 0.7835021 Nuneaton
49 2015-01-31 Football Conference Lincoln 0.4943357 Dover
50 2015-01-31 Football Conference Grimsby 0.7575580 Telford
51 2015-01-31 Football Conference Woking 0.7206525 Alfreton
52 2015-01-31 Football Conference Welling 0.5550018 Chester
53 2015-01-31 Football Conference Kidderminster 0.5429467 Eastleigh
54 2015-01-31 Football Conference Southport 0.5113787 Gateshead
97 2015-01-31 Ryman Premier Kingstonian 0.5111839 Maidstone
113 2015-02-01 English League One Walsall 0.6254755 Gillingham
114 2015-02-01 English League One Bristol C 0.7055175 Fleetwood
111 2015-02-01 English Premier Arsenal 0.7391717 Aston Villa
112 2015-02-01 English Premier Southampton 0.7055737 Swansea
124 2015-02-03 Conference North Stockport 0.5920790 Barrow
140 2015-02-03 English FA Cup Fulham 0.4669152 Sunderland
141 2015-02-03 English FA Cup Sheff Utd 0.5746091 Preston
142 2015-02-03 English FA Cup Man Utd 0.8570219 Cambridge U
138 2015-02-03 English League One Barnsley 0.5743270 Oldham
143 2015-02-03 FA Trophy Halifax 0.7236927 Dartford
144 2015-02-03 FA Trophy Gateshead 0.6454331 Wrexham
145 2015-02-03 FA Trophy Ebbsfleet 0.4182868 Braintree
118 2015-02-03 Football Conference Dover 0.5826340 Grimsby
119 2015-02-03 Football Conference Alfreton 0.4951449 Lincoln
139 2015-02-03 Football Conference Wrexham 0.5128968 Forest Green
149 2015-02-04 English FA Cup Bolton 0.3915110 Liverpool
147 2015-02-04 Ryman Premier Lewes 0.5321890 Canvey Isl.