library(tidyverse)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(caret)
library(lubridate)
library(partykit)
library(ROCR)
library(lime)
library(engsoccerdata)
library(tidyquant)
library(lubridate)
library(dplyr)
library(forecast)
library(MLmetrics)
library(ggplot2)
library(zoo)One of the expanding areas necessitating good predictive accuracy is sport prediction, due to the large monetary amounts involved in betting. Whatever you call it, referring that slice of action every bookmaker takes out of the odds in order to make their business worthwhile. It varies between each bookmaker, and even from one event to another, sometimes with special offers like ‘reduced juice’ on particular leagues and tournaments, or higher odds and greater returns compared to those offered by the competing bookmakers, solely to attract customers.
What we are trying to accomplish by creating this machine learning model, is to create a model to predict the outcomes of football matches in order to take value in sports positions against the market.
What does value mean? Value means a position for a match that is under-valued by other people in the market. For example, when you get a 2.01 times pay out for a coin toss landing on heads, that is a position in good value, because overtime in the long run over 1000 coin tosses you statistically will win 500 of them.
The goal of this documentation will be the use of machine learning to create a classification model in order to predict outcomes of a match. The success of creating this model can help sports traders to profit from their positions in the sports market.
Data that will be used will be from football-data.co.uk, which contains match information and statistics from all football games in Europe. We will be using England,Germany,Spain,Italy’s top and second league, and also Belgium,Turkey,and France’s Top League.
The csv files are separated by season and country. So the first pre-process that will be done is combining them all together into a data frame.
Here we also pick collumns that we think are relevant to the outcome of the match with the betting odds represented by one bookmaker, which is Bet 365.
a <- read.csv("england_all.csv") %>%
select(Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HC,HS,HST,HF,HY,HR,AC,AS,AST,AF,AY,AR,B365H,B365D,B365A,BbAv.2.5,BbAv.2.5.1) %>% mutate(Tier = "top")
b <- read.csv("spain_all.csv") %>%
select(Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HC,HS,HST,HF,HY,HR,AC,AS,AST,AF,AY,AR,B365H,B365D,B365A,BbAv.2.5,BbAv.2.5.1)%>% mutate(Tier = "top")
c <- read.csv("german_all.csv") %>%
select(Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HC,HS,HST,HF,HY,HR,AC,AS,AST,AF,AY,AR,B365H,B365D,B365A,BbAv.2.5,BbAv.2.5.1)%>% mutate(Tier = "top")
d <- read.csv("italy_all.csv")%>%
select(Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HC,HS,HST,HF,HY,HR,AC,AS,AST,AF,AY,AR,B365H,B365D,B365A,BbAv.2.5,BbAv.2.5.1)%>% mutate(Tier = "top")
e <- read.csv("d1_all.csv")%>%
select(Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HC,HS,HST,HF,HY,HR,AC,AS,AST,AF,AY,AR,B365H,B365D,B365A,BbAv.2.5,BbAv.2.5.1)%>% mutate(Tier = "second")
f <- read.csv("other_all.csv")%>%
select(Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HC,HS,HST,HF,HY,HR,AC,AS,AST,AF,AY,AR,B365H,B365D,B365A,BbAv.2.5,BbAv.2.5.1)%>% mutate(Tier = "second")
g <- read.csv("F1_18.csv") %>%
mutate(Season = 2018) %>%
select(Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HC,HS,HST,HF,HY,HR,AC,AS,AST,AF,AY,AR,B365H,B365D,B365A,BbAv.2.5,BbAv.2.5.1) %>% mutate(Tier = "second")
a <- a[,names(d)]
b <- b[,names(d)]
c <- c[,names(d)]
e <- e[,names(d)]
f <- f[,names(d)]
g <- g[,names(d)]
h <- h[,names(d)]
d <- rbind(a,b,c,d,e,f,g,h)
d <- distinct(d)
d <- d %>%
select(Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HC,HS,HST,HF,HY,HR,AC,AS,AST,AF,AY,AR,B365H,B365D,B365A,BbAv.2.5,BbAv.2.5.1,Tier) %>%
mutate(Date = as_date(dmy(Date)),
HTG = FTHG+FTAG-HTHG-HTAG,
month = month(Date, label = TRUE),
day = wday(Date, label = TRUE),
dayofmonth = day(Date),
tsot = AST+HST,
fav = ifelse(B365H<B365A,"H","A"),
winner = ifelse(FTR == fav,"fav","ud")) %>%
select(Season,Date,month,day,dayofmonth,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HTG,HC,tsot,HS,HST,HF,HY,HR,AC,AS,AST,AF,AY,AR,B365H,B365D,B365A,BbAv.2.5,BbAv.2.5.1,Tier,winner,fav)
head(d)In this preprocess stage we will be processing the moving averages of certain variables in order to correctly portray the form of the current teams. There will be 2 moving averages being calculated here, one for the last 3 games, and 1 for the last 10 games. The goal is for the the 3 game moving average to represent more recent form, and the 10 game moving average to represent a sort of power ratings of the clubs that are playing.
Here we do this to a number of variables:
‘ma.ayellows10’ : Moving Average of the Away Team Yellow Cards in the last 10 Matches
‘ma.ayellows3’ : Moving Average of the Away Team Yellow Cards in the last 3 Matches
#Moving Average Goals 3
Clubs <- levels(d$HomeTeam)
temp1 <- NULL
temp <- NULL
for(y in 2016:2019){
for(x in Clubs){
temp <- d %>%
filter(HomeTeam == x|AwayTeam == x,
Season == y) %>%
arrange(Date) %>%
mutate(goals = ifelse(HomeTeam == x,FTHG,FTAG),
hgoals = ifelse(HomeTeam == x,1,0),
agoals = ifelse(AwayTeam == x,1,0),
lag1 = lag(goals)) %>%
select(Season,HomeTeam,AwayTeam,Date,goals,hgoals,agoals,lag1)
temp1 <- rbind(temp1,temp)
}
}
movingavg.goals <- temp1 %>%
mutate(m.avggoals=rollapply(lag1,3,mean,align='right',fill=NA))
movingavg.goals <- movingavg.goals %>%
mutate(ma.hgoals = ifelse(hgoals == 1,m.avggoals,0),
ma.agoals = ifelse(agoals == 1,m.avggoals,0)) %>%
group_by(Date,HomeTeam,AwayTeam) %>%
summarise(ma.hgoals = sum(ma.hgoals),
ma.agoals = sum(ma.agoals)) %>%
ungroup()
d<- merge(movingavg.goals,d,by=c("Date","HomeTeam","AwayTeam"))#Moving Average goals 10
Clubs <- levels(d$HomeTeam)
temp1 <- NULL
temp <- NULL
for(y in 2016:2019){
for(x in Clubs){
temp <- d %>%
filter(HomeTeam == x|AwayTeam == x,
Season == y) %>%
arrange(Date) %>%
mutate(goals10 = ifelse(HomeTeam == x,FTHG,FTAG),
hgoals10 = ifelse(HomeTeam == x,1,0),
agoals10 = ifelse(AwayTeam == x,1,0),
lag1 = lag(goals10)) %>%
select(Season,HomeTeam,AwayTeam,Date,goals10,hgoals10,agoals10,lag1)
temp1 <- rbind(temp1,temp)
}
}
movingavg.goals10 <- temp1 %>%
mutate(m.avggoals10=rollapply(lag1,10,mean,align='right',fill=NA))
movingavg.goals10 <- movingavg.goals10 %>%
mutate(ma.hgoals10 = ifelse(hgoals10 == 1,m.avggoals10,0),
ma.agoals10 = ifelse(agoals10 == 1,m.avggoals10,0)) %>%
group_by(Date,HomeTeam,AwayTeam) %>%
summarise(ma.hgoals10 = sum(ma.hgoals10),
ma.agoals10 = sum(ma.agoals10)) %>%
ungroup()
d<- merge(movingavg.goals10,d,by=c("Date","HomeTeam","AwayTeam"))We Separate the december matches of 2019 as a data test to the model later. We come up with the final database of ‘d’ and our later data test of ‘decembertest’.
#We Separate the december matches of 2019 as a data test to the model later
decembertest <- d %>%
filter(Season == 2019,
month == "Dec")
d <- d %>%
group_by(Date) %>%
tail(-nrow(decembertest)) %>%
ungroup()
d <- na.omit(d)
head(decembertest)After our pre process stage, we end up with our final dataframe and datatest.
We then define and eliminate extreme outliers from our dataframe, using Principal Component Analysis (PCA), These outliers have a high probability in creating bias results to our machine learning model as these datapoints seem to not represent the common distribution of the majority of the data.
library(FactoMineR)
#Allows us to look at the distribution of data and eliminate extreme Outliers that can possibly effect the machine learning process
pca.d1 <- d %>%
mutate(ofsh = as.factor(ifelse(FTHG+FTAG-HTHG-HTAG>1,"yes","no")),
htgdiff = HTHG - HTAG) %>%
select(-HomeTeam,-AwayTeam,-Date,-FTHG,-FTAG,-HC,-HS,-HST,-HF,-HY,-HR,-AC,-AS,-AST,-AF,-AY,-AR,-Season,-HTR,-
HTG,-BbAv.2.5,-BbAv.2.5.1,-tsot,-month,-day,-dayofmonth,-ofsh,-htgdiff,-Tier,-winner,-HTHG,-HTAG,-B365H,-B365D,-B365A,-FTR,-fav)
pca.d <-prcomp(pca.d1, scale. = T)
pca.dd <- PCA(pca.d1, scale.unit = TRUE, graph = F)
#Highligts our main 5 outliers
plot.PCA(pca.dd, cex=0.6,choix=("ind"),select = "contrib5")#We eliminate the extreme outliers
pca.df <- as.data.frame(pca.d$x)
pca.df <- as.data.frame(pca.d$x) %>%
mutate(indexx = c(1:nrow(pca.df)))
outlier <- pca.df %>%
filter(PC1>5) %>%
select(indexx)
outlier.index <- outlier$indexx
outlier.index#> [1] 3917
#PCA(2)
pca.d1 <- d %>%
mutate(ofsh = as.factor(ifelse(FTHG+FTAG-HTHG-HTAG>1,"yes","no")),
htgdiff = HTHG - HTAG) %>%
select(-HomeTeam,-AwayTeam,-Date,-FTHG,-FTAG,-HC,-HS,-HST,-HF,-HY,-HR,-AC,-AS,-AST,-AF,-AY,-AR,-Season,-HTR,-
HTG,-BbAv.2.5,-BbAv.2.5.1,-tsot,-month,-day,-dayofmonth,-ofsh,-htgdiff,-Tier,-winner,-HTHG,-HTAG,-B365H,-B365D,-B365A,-FTR,-fav)
pca.d <-prcomp(pca.d1, scale. = T)
pca.dd <- PCA(pca.d1, scale.unit = TRUE, graph = F)
#Highligts our main 5 outliers
plot.PCA(pca.dd, cex=0.6,choix=("ind"),select = "contrib5")#We eliminate the extreme outliers (2)
pca.df <- as.data.frame(pca.d$x)
pca.df <- as.data.frame(pca.d$x) %>%
mutate(indexx = c(1:nrow(pca.df)))
outlier <- pca.df %>%
filter(PC1>4) %>%
select(indexx)
outlier.index <- outlier$indexx
outlier.index#> [1] 1004 1323 1512 1517 1557 1586 1647 1920 2933 3312 3345 3544 3668 3728
#> [15] 3732 3895 5893 6278
#PCA(3)
pca.d1 <- d %>%
mutate(ofsh = as.factor(ifelse(FTHG+FTAG-HTHG-HTAG>1,"yes","no")),
htgdiff = HTHG - HTAG) %>%
select(-HomeTeam,-AwayTeam,-Date,-FTHG,-FTAG,-HC,-HS,-HST,-HF,-HY,-HR,-AC,-AS,-AST,-AF,-AY,-AR,-Season,-HTR,-
HTG,-BbAv.2.5,-BbAv.2.5.1,-tsot,-month,-day,-dayofmonth,-ofsh,-htgdiff,-Tier,-winner,-HTHG,-HTAG,-B365H,-B365D,-B365A,-FTR,-fav)
pca.d <-prcomp(pca.d1, scale. = T)
pca.dd <- PCA(pca.d1, scale.unit = TRUE, graph = F)
#Highligts our main 5 outliers
plot.PCA(pca.dd, cex=0.6,choix=("ind"),select = "contrib5")#We eliminate the extreme outliers (3)
pca.df <- as.data.frame(pca.d$x)
pca.df <- as.data.frame(pca.d$x) %>%
mutate(indexx = c(1:nrow(pca.df)))
outlier <- pca.df %>%
filter(PC1>3) %>%
select(indexx)
outlier.index <- outlier$indexx
outlier.index#> [1] 23 52 309 349 402 420 448 657 786 877 1610 3981 4169 4199
#> [15] 4452 4710 4880 4909 5059 5100 5106 5143 5458 5866 5934 6002 6108 6357
#> [29] 6411 6439 6455 6458 6497 6498 6500 6501 6502 6503 6504 6505 6506 6507
#> [43] 6508 6509 6510 6511 6512 6513 6514 6515 6516 6517 6518 6519 6520 6521
#> [57] 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535
#> [71] 6536 6537 6538 6539 6540 6541 6542 6543 6544 6545 6546 6547 6548 6549
#> [85] 6550 6551 6552 6553 6554 6555 6556 6557 6558 6559 6560 6561 6563 6564
#> [99] 6565 6566 6567 6568 6569 6570 6572 6573 6574 6575 6577 6578 6580 6581
#> [113] 6582 6583 6584 6585 6587 6588 6589 6590 6591 6592 6593 6594 6595 6596
#> [127] 6597 6598 6599 6600 6601 6602 6603 6604 6605 6606 6607 6608 6609 6610
#> [141] 6611 6612 6613 6614 6615 6616 6617 6618 6620 6621 6626 6627 6630 6631
#> [155] 6633 6634 6635 6636 6637 6638 6639 6640 6643 6644 6645 6646 6647 6648
#> [169] 6649 6650 6653 6654 6655 6656 6660 6661 6663 6664 6666 6667 6668 6669
#> [183] 6670 6671 6673 6674 6675 6676 6677 6678 6679 6680 6681 6682 6683 6684
#> [197] 6687 6688 6692 6693 6696 6697 6698 6699 6789 6840 7230 7257 7340
#PCA(4)
pca.d1 <- d %>%
mutate(ofsh = as.factor(ifelse(FTHG+FTAG-HTHG-HTAG>1,"yes","no")),
htgdiff = HTHG - HTAG) %>%
select(-HomeTeam,-AwayTeam,-Date,-FTHG,-FTAG,-HC,-HS,-HST,-HF,-HY,-HR,-AC,-AS,-AST,-AF,-AY,-AR,-Season,-HTR,-
HTG,-BbAv.2.5,-BbAv.2.5.1,-tsot,-month,-day,-dayofmonth,-ofsh,-htgdiff,-Tier,-winner,-HTHG,-HTAG,-B365H,-B365D,-B365A,-FTR,-fav)
pca.d <-prcomp(pca.d1, scale. = T)
pca.dd <- PCA(pca.d1, scale.unit = TRUE, graph = F)
#Highligts our main 5 outliers
plot.PCA(pca.dd, cex=0.6,choix=("ind"),select = "contrib5")We try to visualize the data by their variables as to get a picture of the distribution of the data and how it affects the final result.
Here we plot the months against the total goals in a match to get a picture of how the general productivity of the teams are through out the year. As we can see there are certain months that show lower productivity in goals than other months. This i presume can be because of team fixture schedules getting tighter in certain months than others, also the change of the seasons can also affect productivity as changes in general temperature between seasons are common in europe.
# We Plot month to the total number of goals in a match
general_productivity <- d %>%
group_by(month) %>%
summarise(averagetotalgoals = mean (FTAG+FTHG)) %>%
ungroup() %>%
mutate(scaled_productivity = scale(averagetotalgoals))
productivity_month_plot <- ggplot(general_productivity,aes(x=month, y=averagetotalgoals))+
geom_col(aes(fill = averagetotalgoals))+
theme_minimal()+
scale_x_discrete()+
labs(title = "Productivity in Matches across the Year",
y = "Average Total Goals in a Match")+
theme_minimal()
productivity_month_plotIn this graph we plot our own moving average home goals variable (ma.hgoals10) against the full time goals of the homeside (FTHG) in order to see the distribution, and get a picture on how it affects the final home goal result. As we can see, there is a positive trend, in which the higher the ma.hgoals10 is the more likely home goals will be produced.
#to add the total number of goals to the dataframe
d$totalgoals <- d$FTHG+d$FTAG
# We Plot the moving average goals of the last 10 matches of the home team and how it distributes to the average number of goals in a match
FTHG_ma.hgoals_plot <- d %>%
group_by(FTHG) %>%
summarise(average.mahgoals10 = mean(ma.hgoals10)) %>%
ungroup() %>%
mutate(scaled_productivity = scale(average.mahgoals10))
ggplot(FTHG_ma.hgoals_plot,aes(x=FTHG, y=(average.mahgoals10)))+
geom_col(aes(fill = average.mahgoals10))+
theme_minimal()+
scale_x_continuous(breaks=c(1:10))+
labs(title = "How the Moving Average of Goals from the Home Team effect Home Productivity",
y = "Average Goals of the Home Team in a match",
x = "Full Time Home Goals (FTHG)")+
theme_minimal()In this third graph we plot the average days of teams have in between league fixtures and see how it affects the home team goal productivity (FTHG). Here we can see a rather different distribution from our second graph, in which here it seems that there is more of an optimum value where if hometeams have more than 5 days off the resting time becomes counterproductive and yields poorer productivity on the pitch.
# We Plot the total days off of the home team and how it effects to the average number of goals in a match
FTG_daysoff_plot <- d %>%
group_by(hdaysoff) %>%
summarise(average.goals = mean(FTHG+FTAG)) %>%
ungroup() %>%
mutate(scaled_productivity = scale(average.goals)) %>%
head(13)
ggplot(FTG_daysoff_plot,aes(x=hdaysoff, y=(average.goals)))+
geom_col(aes(fill = average.goals))+
theme_minimal()+
scale_x_continuous(breaks=c(1:14))+
labs(title = "How the Number of days off of the Home Team effect Home Productivity",
y = "Full Time Home Goals (FTHG)",
x = "Average Days Off of the Home Team before the match")+
theme_minimal()Further visualization can be done, but what we can take from these visualizations are that various variables from the dataframe effect results of the match result, some have different distributions than others, for example average daysoff, has a more optimum value that is diffent from the ma.hgoals10 variable where the higher the value the higher the outcome.
After the data exploration that we have done, we can now further use this data frame to create the prediction model