Introduction

This is hopefully the first in a weekly series of forecasts.

Using data from the very first recorded match in 1877 through to matches this week, collected from http://www.socerbase.com, I plan to develop a forecasting model to predict match outcomes. This first week’s attempt will undoubtedly be rudimentary, and contain many errors, but nonetheless it is important to start making forecasts, in order to improve them.

First the data is loaded (which can be accessed here), then the forecasts are constructed (those matches can be found here). Subsequently, the forecasts are reported, and after this an exercise in ex post forecast accuracy is carried out, with the model being re-estimated each week over the previous year and being used to forecast matches in the subsequent week, and finally a table is presented with details on all the matches forecast, and the forecast outcomes.

Loading the Data

The forecasts are now loaded up (the code that creates them involves calculating Elo scores and league tables and hence takes some time to run):

library(knitr)
wd <- "/home/readejj/Dropbox/Teaching/Reading/ec313/2015/Football-forecasts/"
forecast.matches <- read.csv(paste(wd,"forecasts_",Sys.Date(),".csv",sep=""))
forecast.matches <- forecast.matches[is.na(forecast.matches$outcome)==F,]

The Forecast Model

The forecasts were generated elsewhere using a model based on team Elo scores, league positions (pos), points amassed (pts), goals scored (gs), goal differences (gd), matches played in a year (pld), and recent form (form). All variables are entered with the level for the home team (1), and the difference between the home team and away team (.D, in order to reduce potential collinearity), plus the difference squared (.D.2). A time trend is also added in order to pick up any pattern between years in terms of home advantage. A simple improvement over this would be to include dummies for each year in order to allow non-linear variation in home advantage.

The model is estimated via Ordinary Least Squares on the discrete dependent variable: $y_{it} = \left\{\begin{array}{lll}0 && \text{if away win,}\\ 0.5 && \text{if draw,}\\ 1 && \text{if home win.}\end{array}\right.$ The resulting forecasts thus may fall outside the unit interval, and the errors will be heteroskedastic. Nonetheless, it is a reasonable first model; the task of automating data collection so that the estimation period is fully up to date has been an important first step.

The model is estimated here and reported:

res.eng <- read.csv(paste(wd,"historical_",Sys.Date(),".csv",sep=""))
model <- lm(outcome ~ E.1 + pts1 + pts.D + pts.D.2 + pld1 + pld.D + pld.D.2 + gs1 + gs.D + gs.D.2 
            + gd1 + gd.D + gd.D.2 
            + pos1 + pos.D + pos.D.2 + form1 + form.D + form.D.2 + tier1 + tier.D + tier.D.2 + season.d,
            data=res.eng)
summary(model)

## 
## Call:
## lm(formula = outcome ~ E.1 + pts1 + pts.D + pts.D.2 + pld1 + 
##     pld.D + pld.D.2 + gs1 + gs.D + gs.D.2 + gd1 + gd.D + gd.D.2 + 
##     pos1 + pos.D + pos.D.2 + form1 + form.D + form.D.2 + tier1 + 
##     tier.D + tier.D.2 + season.d, data = res.eng)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0175 -0.2930  0.1394  0.3497  0.8446 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.016e-01  7.143e-03  56.222  < 2e-16 ***
## E.1          4.047e-01  1.098e-02  36.864  < 2e-16 ***
## pts1         1.034e-03  4.309e-04   2.400  0.01639 *  
## pts.D       -2.852e-03  3.154e-04  -9.043  < 2e-16 ***
## pts.D.2     -1.375e-05  6.576e-06  -2.091  0.03653 *  
## pld1        -1.714e-03  6.194e-04  -2.767  0.00565 ** 
## pld.D        3.385e-03  7.240e-04   4.676 2.93e-06 ***
## pld.D.2     -4.603e-05  3.132e-05  -1.470  0.14158    
## gs1          5.092e-04  1.737e-04   2.932  0.00337 ** 
## gs.D        -3.284e-05  1.552e-04  -0.212  0.83238    
## gs.D.2      -1.610e-06  4.766e-06  -0.338  0.73556    
## gd1         -6.829e-04  2.446e-04  -2.792  0.00524 ** 
## gd.D         3.427e-03  1.785e-04  19.201  < 2e-16 ***
## gd.D.2      -5.695e-06  2.381e-06  -2.392  0.01674 *  
## pos1         7.893e-04  3.053e-04   2.585  0.00973 ** 
## pos.D       -4.206e-04  2.584e-04  -1.628  0.10355    
## pos.D.2      3.573e-05  1.189e-05   3.004  0.00267 ** 
## form1        7.773e-04  3.574e-04   2.175  0.02966 *  
## form.D      -2.158e-03  3.345e-04  -6.451 1.12e-10 ***
## form.D.2    -7.858e-05  3.045e-05  -2.581  0.00985 ** 
## tier1        1.978e-03  7.784e-04   2.541  0.01104 *  
## tier.D      -5.403e-02  3.173e-03 -17.027  < 2e-16 ***
## tier.D.2    -5.891e-03  1.278e-03  -4.612 4.00e-06 ***
## season.d    -1.103e-03  3.134e-05 -35.199  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4008 on 215583 degrees of freedom
##   (37937 observations deleted due to missingness)
## Multiple R-squared:  0.05675,    Adjusted R-squared:  0.05665 
## F-statistic:   564 on 23 and 215583 DF,  p-value: < 2.2e-16

The resulting forecasts are between zero and one, and are effectively a probability of the home team winning. Thus, a forecast less than 50% suggests that the away team is more likely to win, while a forecast around 50% implies a draw is a quite likely result. The disappointing aspect of these forecasts is that there is not a huge amount of variation, with the entire range of them being 0.4655109.

The Forecasts

First, our Premier League forecasts:

prem.matches <- forecast.matches[forecast.matches$division=="English Premier",]
prem.matches$id <- 1:NROW(prem.matches)
par(mar=c(9,4,4,5)+.1)
plot(prem.matches$id,prem.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend Premier League Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=prem.matches$id,labels=paste(prem.matches$team1,prem.matches$team2,sep=" v "),las=2,cex.axis=0.65)

Hence Man United, Arsenal, Southampton and Stoke are all expected to gain home wins, with the forecast at 0.7, while Chelsea have a 63% probability of overcoming Man City (note that this model does not factor in team news such as the suspension of Diego Costa, nor the injury for Stoke of Bojan).

Next, our Championship forecasts:

champ.matches <- forecast.matches[forecast.matches$division=="English Championship",]
champ.matches$id <- 1:NROW(champ.matches)
par(mar=c(9,4,4,5)+.1)
plot(champ.matches$id,champ.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend Championship Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=champ.matches$id,labels=paste(champ.matches$team1,champ.matches$team2,sep=" v "),las=2,cex.axis=0.65)

There is a greater range of probabalistic forecasts for the Championship relative to the Premier League, with Blackpool and Cardiff only at just above 40% to beat Brighton and Derby, respectively.

Next, our League One forecasts:

lg1.matches <- forecast.matches[forecast.matches$division=="English League One",]
lg1.matches$id <- 1:NROW(lg1.matches)
par(mar=c(9,4,4,5)+.1)
plot(lg1.matches$id,lg1.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend League One Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=lg1.matches$id,labels=paste(lg1.matches$team1,lg1.matches$team2,sep=" v "),las=2,cex.axis=0.65)

Next, our League Two forecasts:

lg2.matches <- forecast.matches[forecast.matches$division=="English League Two",]
lg2.matches$id <- 1:NROW(lg2.matches)
par(mar=c(9,4,4,5)+.1)
plot(lg2.matches$id,lg2.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend League Two Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=lg2.matches$id,labels=paste(lg2.matches$team1,lg2.matches$team2,sep=" v "),las=2,cex.axis=0.65)

Next, our Football Conference forecasts:

conf.matches <- forecast.matches[forecast.matches$division=="Football Conference",]
conf.matches$id <- 1:NROW(conf.matches)
par(mar=c(9,4,4,5)+.1)
plot(conf.matches$id,conf.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend Football Conference Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=conf.matches$id,labels=paste(conf.matches$team1,conf.matches$team2,sep=" v "),las=2,cex.axis=0.65)

Finally, there are a number of FA Cup replays still to be played:

facup.matches <- forecast.matches[forecast.matches$division=="English FA Cup",]
facup.matches$id <- 1:NROW(facup.matches)
par(mar=c(9,4,4,5)+.1)
plot(facup.matches$id,facup.matches$outcome,xaxt="n",xlab="",ylim=range(0,1),
     main="Forecasts of Weekend English FA Cup Matches",
     ylab="Probability of Outcome")
abline(h=0.5,lty=2)
abline(h=0.6,lty=3)
abline(h=0.7,lty=2)
axis(1,at=facup.matches$id,labels=paste(facup.matches$team1,facup.matches$team2,sep=" v "),las=2,cex.axis=0.65)

The likelihood of Man United beating Cambridge in their replay at Old Trafford, at 85%, is the largest probability of a home win in all the forecasts produced for this week. Similarly, Bolton’s chances of beating Liverpool in their replay, at 39%, is the smallest probability of a home victory, showing the impact of the divisional difference variable in the regression model.

Training forecasts and Mincer-Zarnowitz Testing

In this section, we run regressions over the previous calendar year, week-by-week, and consider the quality of these forecasts against actual outcomes. We use a Mincer-Zarnowitz regression to do so, namely: $y_{it} = \alpha + \beta \widehat{y}_{it} + e_{it},$ so we regress outcomes on forecasts. The test of forecast accuracy is that $\alpha=0$ and $\beta=1$ , namely that on average, or in expectation, our forecasts are equal to outcomes, and there is no bias.

test.start <- seq(Sys.Date()-365,Sys.Date(),by="weeks")
test.end <- seq(Sys.Date()-365,Sys.Date(),by="weeks")+6
test.outcomes <- data.frame()
for(i in 1:NROW(test.start)) {
#  print(i)
  training.data <- res.eng[res.eng$date<test.start[i],]
  test.data <- res.eng[res.eng$date>=test.start[i] & res.eng$date<=test.end[i],]
  if(NROW(test.data)>0){
    model <- lm(outcome ~ E.1 + pts1 + pts.D + pts.D.2 + pld1 + pld.D + pld.D.2 + gs1 + gs.D + gs.D.2 
                + gd1 + gd.D + gd.D.2 
                + pos1 + pos.D + pos.D.2 + form1 + form.D + form.D.2 + tier1 + tier.D + tier.D.2 + season.d,
                data=training.data)
    test.data$"(Intercept)" <- 1
    test.data$forecast <- as.matrix(test.data[,variable.names(model)]) %*% as.numeric(model$coefficients)
    test.outcomes <- rbind(test.outcomes,test.data[,c("match_id","team1","outcome","team2","forecast")])  
  }
}
#mincer-zarnowitz regression
mz <- lm(outcome ~ forecast,data=test.outcomes)
summary(mz)

## 
## Call:
## lm(formula = outcome ~ forecast, data = test.outcomes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86331 -0.44177  0.00247  0.38784  0.86931 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.08919    0.03815  -2.338   0.0194 *  
## forecast     1.12728    0.06460  17.450   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4144 on 4784 degrees of freedom
##   (4093 observations deleted due to missingness)
## Multiple R-squared:  0.05984,    Adjusted R-squared:  0.05964 
## F-statistic: 304.5 on 1 and 4784 DF,  p-value: < 2.2e-16

calib <- aggregate(test.outcomes$outcome,by=list(round(test.outcomes$forecast,2)),FUN=mean,na.rm=T)
plot(calib$Group.1,calib$x,xlim=range(0,1),ylim=range(0,1),main="Calibration of Forecasts, Graphically",
     ylab="% of time match forecast turned out as home win",xlab="Forecast probability of home win")
abline(0,1)

The regression summary suggests that the model is not particularly bad, with an $\alpha$ coefficient barely significantly different from zero, and via a t-test similarly the $\beta$ coefficient is insignificantly different from 1. Similarly, the calibration plot, which plots the frequency with which matches forecast in particular intervals turn out as predicted, suggests that the model is reasonably accurate. In the plot, we would hope to find the scatter points around the 45-degree line, since that represents that matches forecast to end as a home win with a probability of $x$ % turn out as home wins $x$ % of the time. What we find is that, actually, our model exhibits favourite-longshot bias, namely that it under-predicts favourites to win (hence points above the 45-degree line nearer to 1), and over-predicts outsiders (points below the line nearer 0). This bias is commonly found amongst bookmaker prices.

List of all forecasts

For transparency, all forecasts are also listed as a table:

kable(forecast.matches[order(forecast.matches$date,forecast.matches$division),
                       c("date","division","team1","outcome","team2")])

	date	division	team1	outcome	team2
1	2015-01-30	English Championship	Bournemouth	0.6279495	Watford
66	2015-01-31	Conference North	Boston Utd	0.5840113	Stockport
72	2015-01-31	Conference North	Stalybridge	0.6576880	Bradford PA
58	2015-01-31	Conference South	Sutton Utd	0.7275668	Farnborough
64	2015-01-31	Conference South	Eastbourne	0.5530349	Bath City
10	2015-01-31	English Championship	Blackpool	0.4433442	Brighton
11	2015-01-31	English Championship	Huddersfield	0.5968631	Leeds
12	2015-01-31	English Championship	Nottm Forest	0.6344475	Millwall
13	2015-01-31	English Championship	Cardiff	0.4264369	Derby
14	2015-01-31	English Championship	Blackburn	0.5970344	Fulham
15	2015-01-31	English Championship	Reading	0.5624859	Sheff Wed
16	2015-01-31	English Championship	Brentford	0.4910053	Middlesbro
17	2015-01-31	English Championship	Charlton	0.5651828	Rotherham
18	2015-01-31	English Championship	Ipswich	0.7287550	Wigan
19	2015-01-31	English Championship	Birmingham	0.5086522	Norwich
20	2015-01-31	English Championship	Bolton	0.5816187	Wolves
21	2015-01-31	English League One	Bradford	0.6847951	Colchester
22	2015-01-31	English League One	Coventry	0.4840699	Rochdale
23	2015-01-31	English League One	Crewe	0.4144929	MK Dons
24	2015-01-31	English League One	Sheff Utd	0.4984990	Swindon
25	2015-01-31	English League One	Crawley	0.4106645	Preston
26	2015-01-31	English League One	Oldham	0.5990283	Notts Co
27	2015-01-31	English League One	Chesterfield	0.6334792	Doncaster
28	2015-01-31	English League One	Leyton Orient	0.5330470	Scunthorpe
29	2015-01-31	English League One	Barnsley	0.5891384	Port Vale
30	2015-01-31	English League One	Peterborough	0.6091395	Yeovil
31	2015-01-31	English League Two	Southend	0.6375144	York
32	2015-01-31	English League Two	Wycombe	0.6968401	Portsmouth
33	2015-01-31	English League Two	Dag & Red	0.6054083	Cheltenham
34	2015-01-31	English League Two	Exeter	0.5762143	Tranmere
35	2015-01-31	English League Two	Burton	0.6433355	Bury
36	2015-01-31	English League Two	Carlisle	0.5772215	Mansfield
37	2015-01-31	English League Two	Stevenage	0.6457155	Oxford
38	2015-01-31	English League Two	Luton	0.5846713	Cambridge U
39	2015-01-31	English League Two	Newport Co	0.5012429	Shrewsbury
40	2015-01-31	English League Two	Accrington	0.5412007	Northampton
41	2015-01-31	English League Two	Morecambe	0.5694802	AFC W’bledon
42	2015-01-31	English League Two	Hartlepool	0.4174981	Plymouth
2	2015-01-31	English Premier	Chelsea	0.6274118	Man City
3	2015-01-31	English Premier	Liverpool	0.5916587	West Ham
4	2015-01-31	English Premier	Hull	0.5251198	Newcastle
5	2015-01-31	English Premier	C Palace	0.6035861	Everton
6	2015-01-31	English Premier	Man Utd	0.7089077	Leicester
7	2015-01-31	English Premier	Stoke	0.6933286	QPR
8	2015-01-31	English Premier	Sunderland	0.5842300	Burnley
9	2015-01-31	English Premier	West Brom	0.5017794	Tottenham
107	2015-01-31	Evo-Stik S Premier	Weymouth	0.6813063	Histon
43	2015-01-31	Football Conference	Altrincham	0.6633193	Aldershot
44	2015-01-31	Football Conference	Dartford	0.4038861	Bristol R
45	2015-01-31	Football Conference	Braintree	0.5103006	Macclesfield
46	2015-01-31	Football Conference	Halifax	0.4962238	Barnet
47	2015-01-31	Football Conference	Wrexham	0.5488996	Torquay
48	2015-01-31	Football Conference	Forest Green	0.7835021	Nuneaton
49	2015-01-31	Football Conference	Lincoln	0.4943357	Dover
50	2015-01-31	Football Conference	Grimsby	0.7575580	Telford
51	2015-01-31	Football Conference	Woking	0.7206525	Alfreton
52	2015-01-31	Football Conference	Welling	0.5550018	Chester
53	2015-01-31	Football Conference	Kidderminster	0.5429467	Eastleigh
54	2015-01-31	Football Conference	Southport	0.5113787	Gateshead
97	2015-01-31	Ryman Premier	Kingstonian	0.5111839	Maidstone
113	2015-02-01	English League One	Walsall	0.6254755	Gillingham
114	2015-02-01	English League One	Bristol C	0.7055175	Fleetwood
111	2015-02-01	English Premier	Arsenal	0.7391717	Aston Villa
112	2015-02-01	English Premier	Southampton	0.7055737	Swansea
124	2015-02-03	Conference North	Stockport	0.5920790	Barrow
140	2015-02-03	English FA Cup	Fulham	0.4669152	Sunderland
141	2015-02-03	English FA Cup	Sheff Utd	0.5746091	Preston
142	2015-02-03	English FA Cup	Man Utd	0.8570219	Cambridge U
138	2015-02-03	English League One	Barnsley	0.5743270	Oldham
143	2015-02-03	FA Trophy	Halifax	0.7236927	Dartford
144	2015-02-03	FA Trophy	Gateshead	0.6454331	Wrexham
145	2015-02-03	FA Trophy	Ebbsfleet	0.4182868	Braintree
118	2015-02-03	Football Conference	Dover	0.5826340	Grimsby
119	2015-02-03	Football Conference	Alfreton	0.4951449	Lincoln
139	2015-02-03	Football Conference	Wrexham	0.5128968	Forest Green
149	2015-02-04	English FA Cup	Bolton	0.3915110	Liverpool
147	2015-02-04	Ryman Premier	Lewes	0.5321890	Canvey Isl.

Forecasting Football Matches, January 30–February 4

J James Reade

30/01/2015