This is hopefully the first in a weekly series of forecasts.
Using data from the very first recorded match in 1877 through to matches this week, collected from http://www.socerbase.com, I plan to develop a forecasting model to predict match outcomes. This first week’s attempt will undoubtedly be rudimentary, and contain many errors, but nonetheless it is important to start making forecasts, in order to improve them.
First the data is loaded (which can be accessed here), then the forecasts are constructed (those matches can be found here). Subsequently, the forecasts are reported, and after this an exercise in ex post forecast accuracy is carried out, with the model being re-estimated each week over the previous year and being used to forecast matches in the subsequent week, and finally a table is presented with details on all the matches forecast, and the forecast outcomes.
The forecasts are now loaded up (the code that creates them involves calculating Elo scores and league tables and hence takes some time to run):
library(knitr)
wd <- "/home/readejj/Dropbox/Teaching/Reading/ec313/2015/Football-forecasts/"
forecast.matches <- read.csv(paste(wd,"forecasts_",Sys.Date(),".csv",sep=""))
forecast.matches <- forecast.matches[is.na(forecast.matches$outcome)==F,]
The forecasts were generated elsewhere using a model based on team Elo scores, league positions (pos
), points amassed (pts
), goals scored (gs
), goal differences (gd
), matches played in a year (pld
), and recent form (form
). All variables are entered with the level for the home team (1
), and the difference between the home team and away team (.D
, in order to reduce potential collinearity), plus the difference squared (.D.2
). A time trend is also added in order to pick up any pattern between years in terms of home advantage. A simple improvement over this would be to include dummies for each year in order to allow non-linear variation in home advantage.
The model is estimated via Ordinary Least Squares on the discrete dependent variable: yit={0if away win,0.5if draw,1if home win. The resulting forecasts thus may fall outside the unit interval, and the errors will be heteroskedastic. Nonetheless, it is a reasonable first model; the task of automating data collection so that the estimation period is fully up to date has been an important first step.
The model is estimated here and reported:
res.eng <- read.csv(paste(wd,"historical_",Sys.Date(),".csv",sep=""))
model <- lm(outcome ~ E.1 + pts1 + pts.D + pts.D.2 + pld1 + pld.D + pld.D.2 + gs1 + gs.D + gs.D.2
+ gd1 + gd.D + gd.D.2
+ pos1 + pos.D + pos.D.2 + form1 + form.D + form.D.2 + tier1 + tier.D + tier.D.2 + season.d,
data=res.eng)
summary(model)
##
## Call:
## lm(formula = outcome ~ E.1 + pts1 + pts.D + pts.D.2 + pld1 +
## pld.D + pld.D.2 + gs1 + gs.D + gs.D.2 + gd1 + gd.D + gd.D.2 +
## pos1 + pos.D + pos.D.2 + form1 + form.D + form.D.2 + tier1 +
## tier.D + tier.D.2 + season.d, data = res.eng)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0175 -0.2930 0.1394 0.3497 0.8446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.016e-01 7.143e-03 56.222 < 2e-16 ***
## E.1 4.047e-01 1.098e-02 36.864 < 2e-16 ***
## pts1 1.034e-03 4.309e-04 2.400 0.01639 *
## pts.D -2.852e-03 3.154e-04 -9.043 < 2e-16 ***
## pts.D.2 -1.375e-05 6.576e-06 -2.091 0.03653 *
## pld1 -1.714e-03 6.194e-04 -2.767 0.00565 **
## pld.D 3.385e-03 7.240e-04 4.676 2.93e-06 ***
## pld.D.2 -4.603e-05 3.132e-05 -1.470 0.14158
## gs1 5.092e-04 1.737e-04 2.932 0.00337 **
## gs.D -3.284e-05 1.552e-04 -0.212 0.83238
## gs.D.2 -1.610e-06 4.766e-06 -0.338 0.73556
## gd1 -6.829e-04 2.446e-04 -2.792 0.00524 **
## gd.D 3.427e-03 1.785e-04 19.201 < 2e-16 ***
## gd.D.2 -5.695e-06 2.381e-06 -2.392 0.01674 *
## pos1 7.893e-04 3.053e-04 2.585 0.00973 **
## pos.D -4.206e-04 2.584e-04 -1.628 0.10355
## pos.D.2 3.573e-05 1.189e-05 3.004 0.00267 **
## form1 7.773e-04 3.574e-04 2.175 0.02966 *
## form.D -2.158e-03 3.345e-04 -6.451 1.12e-10 ***
## form.D.2 -7.858e-05 3.045e-05 -2.581 0.00985 **
## tier1 1.978e-03 7.784e-04 2.541 0.01104 *
## tier.D -5.403e-02 3.173e-03 -17.027 < 2e-16 ***
## tier.D.2 -5.891e-03 1.278e-03 -4.612 4.00e-06 ***
## season.d -1.103e-03 3.134e-05 -35.199 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4008 on 215583 degrees of freedom
## (37937 observations deleted due to missingness)
## Multiple R-squared: 0.05675, Adjusted R-squared: 0.05665
## F-statistic: 564 on 23 and 215583 DF, p-value: < 2.2e-16
The resulting forecasts are between zero and one, and are effectively a probability of the home team winning. Thus, a forecast less than 50% suggests that the away team is more likely to win, while a forecast around 50% implies a draw is a quite likely result. The disappointing aspect of these forecasts is that there is not a huge amount of variation, with the entire range of them being 0.4655109.
First, our Premier League forecasts:
prem.matches <- forecast.matches[forecast.matches$division=="English Premier",]
prem.matches$id <- 1:NROW(prem.matches)
par(mar=c(9,4,4,5)+.1)
plot(prem.matches$id,prem.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
main="Forecasts of Weekend Premier League Matches",
ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=prem.matches$id,labels=paste(prem.matches$team1,prem.matches$team2,sep=" v "),las=2,cex.axis=0.65)
Hence Man United, Arsenal, Southampton and Stoke are all expected to gain home wins, with the forecast at 0.7, while Chelsea have a 63% probability of overcoming Man City (note that this model does not factor in team news such as the suspension of Diego Costa, nor the injury for Stoke of Bojan).
Next, our Championship forecasts:
champ.matches <- forecast.matches[forecast.matches$division=="English Championship",]
champ.matches$id <- 1:NROW(champ.matches)
par(mar=c(9,4,4,5)+.1)
plot(champ.matches$id,champ.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
main="Forecasts of Weekend Championship Matches",
ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=champ.matches$id,labels=paste(champ.matches$team1,champ.matches$team2,sep=" v "),las=2,cex.axis=0.65)
There is a greater range of probabalistic forecasts for the Championship relative to the Premier League, with Blackpool and Cardiff only at just above 40% to beat Brighton and Derby, respectively.
Next, our League One forecasts:
lg1.matches <- forecast.matches[forecast.matches$division=="English League One",]
lg1.matches$id <- 1:NROW(lg1.matches)
par(mar=c(9,4,4,5)+.1)
plot(lg1.matches$id,lg1.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
main="Forecasts of Weekend League One Matches",
ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=lg1.matches$id,labels=paste(lg1.matches$team1,lg1.matches$team2,sep=" v "),las=2,cex.axis=0.65)
Next, our League Two forecasts:
lg2.matches <- forecast.matches[forecast.matches$division=="English League Two",]
lg2.matches$id <- 1:NROW(lg2.matches)
par(mar=c(9,4,4,5)+.1)
plot(lg2.matches$id,lg2.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
main="Forecasts of Weekend League Two Matches",
ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=lg2.matches$id,labels=paste(lg2.matches$team1,lg2.matches$team2,sep=" v "),las=2,cex.axis=0.65)
Next, our Football Conference forecasts:
conf.matches <- forecast.matches[forecast.matches$division=="Football Conference",]
conf.matches$id <- 1:NROW(conf.matches)
par(mar=c(9,4,4,5)+.1)
plot(conf.matches$id,conf.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
main="Forecasts of Weekend Football Conference Matches",
ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=conf.matches$id,labels=paste(conf.matches$team1,conf.matches$team2,sep=" v "),las=2,cex.axis=0.65)
Finally, there are a number of FA Cup replays still to be played:
facup.matches <- forecast.matches[forecast.matches$division=="English FA Cup",]
facup.matches$id <- 1:NROW(facup.matches)
par(mar=c(9,4,4,5)+.1)
plot(facup.matches$id,facup.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
main="Forecasts of Weekend English FA Cup Matches",
ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=facup.matches$id,labels=paste(facup.matches$team1,facup.matches$team2,sep=" v "),las=2,cex.axis=0.65)
The likelihood of Man United beating Cambridge in their replay at Old Trafford, at 85%, is the largest probability of a home win in all the forecasts produced for this week. Similarly, Bolton’s chances of beating Liverpool in their replay, at 39%, is the smallest probability of a home victory, showing the impact of the divisional difference variable in the regression model.
In this section, we run regressions over the previous calendar year, week-by-week, and consider the quality of these forecasts against actual outcomes. We use a Mincer-Zarnowitz regression to do so, namely: yit=α+βˆyit+eit, so we regress outcomes on forecasts. The test of forecast accuracy is that α=0 and β=1, namely that on average, or in expectation, our forecasts are equal to outcomes, and there is no bias.
test.start <- seq(Sys.Date()-365,Sys.Date(),by="weeks")
test.end <- seq(Sys.Date()-365,Sys.Date(),by="weeks")+6
test.outcomes <- data.frame()
for(i in 1:NROW(test.start)) {
# print(i)
training.data <- res.eng[res.eng$date<test.start[i],]
test.data <- res.eng[res.eng$date>=test.start[i] & res.eng$date<=test.end[i],]
if(NROW(test.data)>0){
model <- lm(outcome ~ E.1 + pts1 + pts.D + pts.D.2 + pld1 + pld.D + pld.D.2 + gs1 + gs.D + gs.D.2
+ gd1 + gd.D + gd.D.2
+ pos1 + pos.D + pos.D.2 + form1 + form.D + form.D.2 + tier1 + tier.D + tier.D.2 + season.d,
data=training.data)
test.data$"(Intercept)" <- 1
test.data$forecast <- as.matrix(test.data[,variable.names(model)]) %*% as.numeric(model$coefficients)
test.outcomes <- rbind(test.outcomes,test.data[,c("match_id","team1","outcome","team2","forecast")])
}
}
#mincer-zarnowitz regression
mz <- lm(outcome ~ forecast,data=test.outcomes)
summary(mz)
##
## Call:
## lm(formula = outcome ~ forecast, data = test.outcomes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.86331 -0.44177 0.00247 0.38784 0.86931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.08919 0.03815 -2.338 0.0194 *
## forecast 1.12728 0.06460 17.450 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4144 on 4784 degrees of freedom
## (4093 observations deleted due to missingness)
## Multiple R-squared: 0.05984, Adjusted R-squared: 0.05964
## F-statistic: 304.5 on 1 and 4784 DF, p-value: < 2.2e-16
calib <- aggregate(test.outcomes$outcome,by=list(round(test.outcomes$forecast,2)),FUN=mean,na.rm=T)
plot(calib$Group.1,calib$x,xlim=range(0,1),ylim=range(0,1),main="Calibration of Forecasts, Graphically",
ylab="% of time match forecast turned out as home win",xlab="Forecast probability of home win")
abline(0,1)
The regression summary suggests that the model is not particularly bad, with an α coefficient barely significantly different from zero, and via a t-test similarly the β coefficient is insignificantly different from 1. Similarly, the calibration plot, which plots the frequency with which matches forecast in particular intervals turn out as predicted, suggests that the model is reasonably accurate. In the plot, we would hope to find the scatter points around the 45-degree line, since that represents that matches forecast to end as a home win with a probability of x% turn out as home wins x% of the time. What we find is that, actually, our model exhibits favourite-longshot bias, namely that it under-predicts favourites to win (hence points above the 45-degree line nearer to 1), and over-predicts outsiders (points below the line nearer 0). This bias is commonly found amongst bookmaker prices.
For transparency, all forecasts are also listed as a table:
kable(forecast.matches[order(forecast.matches$date,forecast.matches$division),
c("date","division","team1","outcome","team2")])
date | division | team1 | outcome | team2 | |
---|---|---|---|---|---|
1 | 2015-01-30 | English Championship | Bournemouth | 0.6279495 | Watford |
66 | 2015-01-31 | Conference North | Boston Utd | 0.5840113 | Stockport |
72 | 2015-01-31 | Conference North | Stalybridge | 0.6576880 | Bradford PA |
58 | 2015-01-31 | Conference South | Sutton Utd | 0.7275668 | Farnborough |
64 | 2015-01-31 | Conference South | Eastbourne | 0.5530349 | Bath City |
10 | 2015-01-31 | English Championship | Blackpool | 0.4433442 | Brighton |
11 | 2015-01-31 | English Championship | Huddersfield | 0.5968631 | Leeds |
12 | 2015-01-31 | English Championship | Nottm Forest | 0.6344475 | Millwall |
13 | 2015-01-31 | English Championship | Cardiff | 0.4264369 | Derby |
14 | 2015-01-31 | English Championship | Blackburn | 0.5970344 | Fulham |
15 | 2015-01-31 | English Championship | Reading | 0.5624859 | Sheff Wed |
16 | 2015-01-31 | English Championship | Brentford | 0.4910053 | Middlesbro |
17 | 2015-01-31 | English Championship | Charlton | 0.5651828 | Rotherham |
18 | 2015-01-31 | English Championship | Ipswich | 0.7287550 | Wigan |
19 | 2015-01-31 | English Championship | Birmingham | 0.5086522 | Norwich |
20 | 2015-01-31 | English Championship | Bolton | 0.5816187 | Wolves |
21 | 2015-01-31 | English League One | Bradford | 0.6847951 | Colchester |
22 | 2015-01-31 | English League One | Coventry | 0.4840699 | Rochdale |
23 | 2015-01-31 | English League One | Crewe | 0.4144929 | MK Dons |
24 | 2015-01-31 | English League One | Sheff Utd | 0.4984990 | Swindon |
25 | 2015-01-31 | English League One | Crawley | 0.4106645 | Preston |
26 | 2015-01-31 | English League One | Oldham | 0.5990283 | Notts Co |
27 | 2015-01-31 | English League One | Chesterfield | 0.6334792 | Doncaster |
28 | 2015-01-31 | English League One | Leyton Orient | 0.5330470 | Scunthorpe |
29 | 2015-01-31 | English League One | Barnsley | 0.5891384 | Port Vale |
30 | 2015-01-31 | English League One | Peterborough | 0.6091395 | Yeovil |
31 | 2015-01-31 | English League Two | Southend | 0.6375144 | York |
32 | 2015-01-31 | English League Two | Wycombe | 0.6968401 | Portsmouth |
33 | 2015-01-31 | English League Two | Dag & Red | 0.6054083 | Cheltenham |
34 | 2015-01-31 | English League Two | Exeter | 0.5762143 | Tranmere |
35 | 2015-01-31 | English League Two | Burton | 0.6433355 | Bury |
36 | 2015-01-31 | English League Two | Carlisle | 0.5772215 | Mansfield |
37 | 2015-01-31 | English League Two | Stevenage | 0.6457155 | Oxford |
38 | 2015-01-31 | English League Two | Luton | 0.5846713 | Cambridge U |
39 | 2015-01-31 | English League Two | Newport Co | 0.5012429 | Shrewsbury |
40 | 2015-01-31 | English League Two | Accrington | 0.5412007 | Northampton |
41 | 2015-01-31 | English League Two | Morecambe | 0.5694802 | AFC W’bledon |
42 | 2015-01-31 | English League Two | Hartlepool | 0.4174981 | Plymouth |
2 | 2015-01-31 | English Premier | Chelsea | 0.6274118 | Man City |
3 | 2015-01-31 | English Premier | Liverpool | 0.5916587 | West Ham |
4 | 2015-01-31 | English Premier | Hull | 0.5251198 | Newcastle |
5 | 2015-01-31 | English Premier | C Palace | 0.6035861 | Everton |
6 | 2015-01-31 | English Premier | Man Utd | 0.7089077 | Leicester |
7 | 2015-01-31 | English Premier | Stoke | 0.6933286 | QPR |
8 | 2015-01-31 | English Premier | Sunderland | 0.5842300 | Burnley |
9 | 2015-01-31 | English Premier | West Brom | 0.5017794 | Tottenham |
107 | 2015-01-31 | Evo-Stik S Premier | Weymouth | 0.6813063 | Histon |
43 | 2015-01-31 | Football Conference | Altrincham | 0.6633193 | Aldershot |
44 | 2015-01-31 | Football Conference | Dartford | 0.4038861 | Bristol R |
45 | 2015-01-31 | Football Conference | Braintree | 0.5103006 | Macclesfield |
46 | 2015-01-31 | Football Conference | Halifax | 0.4962238 | Barnet |
47 | 2015-01-31 | Football Conference | Wrexham | 0.5488996 | Torquay |
48 | 2015-01-31 | Football Conference | Forest Green | 0.7835021 | Nuneaton |
49 | 2015-01-31 | Football Conference | Lincoln | 0.4943357 | Dover |
50 | 2015-01-31 | Football Conference | Grimsby | 0.7575580 | Telford |
51 | 2015-01-31 | Football Conference | Woking | 0.7206525 | Alfreton |
52 | 2015-01-31 | Football Conference | Welling | 0.5550018 | Chester |
53 | 2015-01-31 | Football Conference | Kidderminster | 0.5429467 | Eastleigh |
54 | 2015-01-31 | Football Conference | Southport | 0.5113787 | Gateshead |
97 | 2015-01-31 | Ryman Premier | Kingstonian | 0.5111839 | Maidstone |
113 | 2015-02-01 | English League One | Walsall | 0.6254755 | Gillingham |
114 | 2015-02-01 | English League One | Bristol C | 0.7055175 | Fleetwood |
111 | 2015-02-01 | English Premier | Arsenal | 0.7391717 | Aston Villa |
112 | 2015-02-01 | English Premier | Southampton | 0.7055737 | Swansea |
124 | 2015-02-03 | Conference North | Stockport | 0.5920790 | Barrow |
140 | 2015-02-03 | English FA Cup | Fulham | 0.4669152 | Sunderland |
141 | 2015-02-03 | English FA Cup | Sheff Utd | 0.5746091 | Preston |
142 | 2015-02-03 | English FA Cup | Man Utd | 0.8570219 | Cambridge U |
138 | 2015-02-03 | English League One | Barnsley | 0.5743270 | Oldham |
143 | 2015-02-03 | FA Trophy | Halifax | 0.7236927 | Dartford |
144 | 2015-02-03 | FA Trophy | Gateshead | 0.6454331 | Wrexham |
145 | 2015-02-03 | FA Trophy | Ebbsfleet | 0.4182868 | Braintree |
118 | 2015-02-03 | Football Conference | Dover | 0.5826340 | Grimsby |
119 | 2015-02-03 | Football Conference | Alfreton | 0.4951449 | Lincoln |
139 | 2015-02-03 | Football Conference | Wrexham | 0.5128968 | Forest Green |
149 | 2015-02-04 | English FA Cup | Bolton | 0.3915110 | Liverpool |
147 | 2015-02-04 | Ryman Premier | Lewes | 0.5321890 | Canvey Isl. |