My friend won TZS 100,000 last weekend by betting on some EPL football matches. Over the weekend, I decided to use Machine Learning and Analytics to solve the football match prediction problem and become rich in the process!.

I needed alot of match data and after browsing, I found this website which hosted match data for most of the leagues in the world. The files came in csv folders for each season. I decided to download files beginning from the 2000/01 season to date. I must admit that teams have changed alot since 2000. Nevertheless, it was quite amazing to have a look at all the data.

#Reading the folder path
v='PL'
p=paste('/home/kevin/Downloads/',v,sep='')
library('dplyr',warn.conflicts=FALSE)
#Defining function to merge the files
combine = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,sep=',')})
Reduce(function(x,y) {bind_rows(x,y)}, datalist)}
#Reading each file and combining them into a single dataframe
a=combine(p)
attach(a)

The dataset consisted of 68 features which contained match statistics and various betting odds.

names(a)
 [1] "Div"        "Date"       "HomeTeam"   "AwayTeam"   "FTHG"      
 [6] "FTAG"       "FTR"        "HTHG"       "HTAG"       "HTR"       
[11] "Referee"    "HS"         "AS"         "HST"        "AST"       
[16] "HF"         "AF"         "HC"         "AC"         "HY"        
[21] "AY"         "HR"         "AR"         "B365H"      "B365D"     
[26] "B365A"      "BWH"        "BWD"        "BWA"        "IWH"       
[31] "IWD"        "IWA"        "LBH"        "LBD"        "LBA"       
[36] "PSH"        "PSD"        "PSA"        "WHH"        "WHD"       
[41] "WHA"        "VCH"        "VCD"        "VCA"        "Bb1X2"     
[46] "BbMxH"      "BbAvH"      "BbMxD"      "BbAvD"      "BbMxA"     
[51] "BbAvA"      "BbOU"       "BbMx.2.5"   "BbAv.2.5"   "BbMx.2.5.1"
[56] "BbAv.2.5.1" "BbAH"       "BbAHh"      "BbMxAHH"    "BbAvAHH"   
[61] "BbMxAHA"    "BbAvAHA"    "PSCH"       "PSCD"       "PSCA"      
[66] "SJH"        "SJD"        "SJA"       

The key for the above features is contained on the data hosting website. They correspond to statistics such as full time goals,number of red cards and bet365 odds for a match. I reduced the column features to a manageable subset which I could work with. This necessitated a switch to SQL as I could simply run some few queries and get the desired table. I turned to the sqldf package to enable SQL on R environment. Below are the features which I selected as my predictors for the outcome of a match.

# HomeTeam = Home Team 
# AwayTeam = Away Team
# B365H = Bet365 home win odds
# B365D = Bet365 draw odds
# B365A = Bet365 away win odds
# LBH = Ladbrokes home win odds
# LBD = Ladbrokes draw odds
# LBA = Ladbrokes away win odds
# FTR = Full Time Result (H=Home Win, D=Draw, A=Away Win)

Suppose that we would like to predict a game between Chelsea and Man City, we would have to select a subset of the data which contains the respective teams. An SQL query would take the below form:

#Team selection
library('sqldf',warn.conflicts=FALSE)
t=sqldf("SELECT B365H,B365D,B365A,LBH,LBD,LBA,FTR FROM a WHERE
        HomeTeam='Chelsea' AND AwayTeam='Man City'")
t[-7,]

The above table shows all the games played between Chelsea and Man City from the year 2000 to 2017. We need to predict the next match using odds data. I obtained the pre match odds from this website in which you can obtain odds from various matches to be played. The part that follows is very exciting as we apply machine learning to predict the full time resutl(FTR) of the next match from the past odds data. I used two Machine Learning techniques namely naive bayes and classification tree.

#Factorization
b$FTR=factor(b$FTR)

#Training and test
library('varhandle')
gtrain=b
gtest<-data.frame(gtrain[,c(1:6)])
gtestlabel<-c(unfactor(gtrain$FTR))
attach(gtrain)

#Naive bayes
library('e1071',warn.conflicts=FALSE)
naive_bayes_model<-naiveBayes(FTR ~ ., data = gtrain)
naive_bayes_predictions<-predict(naive_bayes_model, newdata=gtest)
naive_bayes_accuracy=round(mean(naive_bayes_predictions==gtestlabel),2)*100

#Classification tree
library('party',warn.conflicts=FALSE)
ctree_model<- ctree(FTR ~ ., data = gtrain,controls=ctree_control(minsplit=30,minbucket=10,maxdepth=5))
ctree_predictions <- predict(ctree_model,newdata=gtest,type='response')
ctree_accuracy=round(mean(ctree_predictions==gtestlabel),2)*100

#Creating the input odds instance
odds=data.frame(B365H='0.727',B365D='2.75',B365A='4',LBH='0.727',LBD='2.6',LBA='3.75')

#Using the predict function to make a match prediction
predict(naive_bayes_model,newdata=odds,type = 'class',interval=predict)
plot(naive_bayes_predictions,xlab='Result',ylab='Number of Matches')

On average,Naive Bayes performed better than Classification tree achieving accuracies as high as 70%. I also realized that using team data from 2000 will not add any information in predicting a 2017 match. It was also quite interesting to note that although there were 6643 premier league matches played from the year 2000, you cannot find much data points for any two teams and the problem gets exasperated if one of the teams have recently been promoted from a lower league. For instance,the Chelsea VS Man City match generated only 16 data points out of 6643 matches taking into account the home and away team feature. I thus resorted to analyzing the matches using SQL to calculate the percentage of wins for any home team. This might give an indication of the next match.

s=sqldf("SELECT FTR AS 'Result',COUNT(FTR) AS 'Counts' 
        FROM t GROUP BY FTR")
p=sqldf("SELECT Result,Counts,ROUND(Counts*100.0/(SELECT SUM(Counts) FROM s),0) AS 'Percentage' FROM s GROUP BY Result")
p

Match prediction is very hard thus Machine Learning cannot provide us with the desired level of accuracy. This is mainly due to the fact that there is simply not enough data and the teams are constantly changing. The best we can hope for is to follow our instincts and cross our fingers!

LS0tCnRpdGxlOiAiUHJlZGljdGluZyBFbmdsaXNoIFByZW1pZXIgTGVhZ3VlIG1hdGNoIHJlc3VsdHMgdXNpbmcgTWFjaGluZSBMZWFybmluZyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQpNeSBmcmllbmQgd29uIFRaUyAxMDAsMDAwIGxhc3Qgd2Vla2VuZCBieSBiZXR0aW5nIG9uIHNvbWUgRVBMIGZvb3RiYWxsIG1hdGNoZXMuIE92ZXIgdGhlIHdlZWtlbmQsIEkgZGVjaWRlZCB0byB1c2UgIE1hY2hpbmUgTGVhcm5pbmcgYW5kIEFuYWx5dGljcyB0byBzb2x2ZSB0aGUgZm9vdGJhbGwgbWF0Y2ggcHJlZGljdGlvbiBwcm9ibGVtIGFuZCBiZWNvbWUgcmljaCBpbiB0aGUgcHJvY2VzcyEuCgpJIG5lZWRlZCBhbG90IG9mIG1hdGNoIGRhdGEgYW5kIGFmdGVyIGJyb3dzaW5nLCBJIGZvdW5kIHRoaXMgIFt3ZWJzaXRlXShodHRwOi8vd3d3LmZvb3RiYWxsLWRhdGEuY28udWsvZW5nbGFuZG0ucGhwKSAgd2hpY2ggaG9zdGVkIG1hdGNoIGRhdGEgZm9yIG1vc3Qgb2YgdGhlIGxlYWd1ZXMgaW4gdGhlIHdvcmxkLiBUaGUgZmlsZXMgY2FtZSBpbiBjc3YgZm9sZGVycyBmb3IgZWFjaCBzZWFzb24uIEkgZGVjaWRlZCB0byBkb3dubG9hZCBmaWxlcyBiZWdpbm5pbmcgZnJvbSB0aGUgMjAwMC8wMSBzZWFzb24gdG8gZGF0ZS4gSSBtdXN0IGFkbWl0IHRoYXQgdGVhbXMgaGF2ZSBjaGFuZ2VkIGFsb3Qgc2luY2UgMjAwMC4gIE5ldmVydGhlbGVzcywgaXQgd2FzIHF1aXRlIGFtYXppbmcgdG8gaGF2ZSBhIGxvb2sgYXQgYWxsIHRoZSBkYXRhLgoKYGBge3IsIG1lc3NhZ2U9RkFMU0UsIHdhcm5pbmc9RkFMU0V9CiNSZWFkaW5nIHRoZSBmb2xkZXIgcGF0aAp2PSdQTCcKcD1wYXN0ZSgnL2hvbWUva2V2aW4vRG93bmxvYWRzLycsdixzZXA9JycpCmxpYnJhcnkoJ2RwbHlyJyx3YXJuLmNvbmZsaWN0cz1GQUxTRSkKI0RlZmluaW5nIGZ1bmN0aW9uIHRvIG1lcmdlIHRoZSBmaWxlcwpjb21iaW5lID0gZnVuY3Rpb24obXlwYXRoKXsKZmlsZW5hbWVzPWxpc3QuZmlsZXMocGF0aD1teXBhdGgsIGZ1bGwubmFtZXM9VFJVRSkKZGF0YWxpc3QgPSBsYXBwbHkoZmlsZW5hbWVzLCBmdW5jdGlvbih4KXtyZWFkLmNzdihmaWxlPXgsc2VwPScsJyl9KQpSZWR1Y2UoZnVuY3Rpb24oeCx5KSB7YmluZF9yb3dzKHgseSl9LCBkYXRhbGlzdCl9CiNSZWFkaW5nIGVhY2ggZmlsZSBhbmQgY29tYmluaW5nIHRoZW0gaW50byBhIHNpbmdsZSBkYXRhZnJhbWUKYT1jb21iaW5lKHApCmF0dGFjaChhKQpgYGAKVGhlIGRhdGFzZXQgY29uc2lzdGVkIG9mIDY4IGZlYXR1cmVzIHdoaWNoIGNvbnRhaW5lZCBtYXRjaCBzdGF0aXN0aWNzIGFuZCB2YXJpb3VzIGJldHRpbmcgb2Rkcy4KYGBge3J9Cm5hbWVzKGEpCmBgYApUaGUga2V5IGZvciB0aGUgYWJvdmUgZmVhdHVyZXMgaXMgY29udGFpbmVkIG9uIHRoZSBkYXRhIGhvc3Rpbmcgd2Vic2l0ZS4gVGhleSBjb3JyZXNwb25kIHRvIHN0YXRpc3RpY3Mgc3VjaCBhcyBmdWxsIHRpbWUgZ29hbHMsbnVtYmVyIG9mIHJlZCBjYXJkcyBhbmQgYmV0MzY1IG9kZHMgZm9yIGEgbWF0Y2guIEkgcmVkdWNlZCB0aGUgY29sdW1uIGZlYXR1cmVzIHRvIGEgbWFuYWdlYWJsZSBzdWJzZXQgd2hpY2ggSSBjb3VsZCB3b3JrIHdpdGguIFRoaXMgbmVjZXNzaXRhdGVkIGEgc3dpdGNoIHRvIFNRTCBhcyBJIGNvdWxkIHNpbXBseSBydW4gc29tZSBmZXcgcXVlcmllcyBhbmQgZ2V0IHRoZSBkZXNpcmVkIHRhYmxlLiBJIHR1cm5lZCB0byB0aGUgc3FsZGYgcGFja2FnZSB0byBlbmFibGUgU1FMIG9uIFIgZW52aXJvbm1lbnQuIEJlbG93IGFyZSB0aGUgZmVhdHVyZXMgd2hpY2ggSSBzZWxlY3RlZCBhcyBteSBwcmVkaWN0b3JzIGZvciB0aGUgb3V0Y29tZSBvZiBhIG1hdGNoLgoKYGBge3J9CiMgSG9tZVRlYW0gPSBIb21lIFRlYW0gCiMgQXdheVRlYW0gPSBBd2F5IFRlYW0KIyBCMzY1SCA9IEJldDM2NSBob21lIHdpbiBvZGRzCiMgQjM2NUQgPSBCZXQzNjUgZHJhdyBvZGRzCiMgQjM2NUEgPSBCZXQzNjUgYXdheSB3aW4gb2RkcwojIExCSCA9IExhZGJyb2tlcyBob21lIHdpbiBvZGRzCiMgTEJEID0gTGFkYnJva2VzIGRyYXcgb2RkcwojIExCQSA9IExhZGJyb2tlcyBhd2F5IHdpbiBvZGRzCiMgRlRSID0gRnVsbCBUaW1lIFJlc3VsdCAoSD1Ib21lIFdpbiwgRD1EcmF3LCBBPUF3YXkgV2luKQpgYGAKClN1cHBvc2UgdGhhdCB3ZSB3b3VsZCBsaWtlIHRvIHByZWRpY3QgYSBnYW1lIGJldHdlZW4gQ2hlbHNlYSBhbmQgTWFuIENpdHksICB3ZSB3b3VsZCBoYXZlIHRvIHNlbGVjdCBhIHN1YnNldCBvZiB0aGUgZGF0YSB3aGljaCBjb250YWlucyB0aGUgcmVzcGVjdGl2ZSB0ZWFtcy4gQW4gU1FMIHF1ZXJ5IHdvdWxkIHRha2UgdGhlIGJlbG93IGZvcm06CgpgYGB7ciwgZWNobz1UUlVFLCBtZXNzYWdlPUZBTFNFLCB3YXJuaW5nPUZBTFNFfQojVGVhbSBzZWxlY3Rpb24KbGlicmFyeSgnc3FsZGYnLHdhcm4uY29uZmxpY3RzPUZBTFNFKQp0PXNxbGRmKCJTRUxFQ1QgQjM2NUgsQjM2NUQsQjM2NUEsTEJILExCRCxMQkEsRlRSIEZST00gYSBXSEVSRQogICAgICAgIEhvbWVUZWFtPSdDaGVsc2VhJyBBTkQgQXdheVRlYW09J01hbiBDaXR5JyIpCnRbLTcsXQpgYGAKVGhlIGFib3ZlIHRhYmxlIHNob3dzIGFsbCB0aGUgZ2FtZXMgcGxheWVkIGJldHdlZW4gQ2hlbHNlYSBhbmQgTWFuIENpdHkgZnJvbSB0aGUgeWVhciAyMDAwIHRvIDIwMTcuIFdlIG5lZWQgdG8gcHJlZGljdCB0aGUgbmV4dCBtYXRjaCB1c2luZyBvZGRzIGRhdGEuIEkgb2J0YWluZWQgdGhlIHByZSBtYXRjaCBvZGRzIGZyb20gdGhpcyBbd2Vic2l0ZV0oaHR0cHM6Ly93d3cub2Rkc2NoZWNrZXIuY29tL2Zvb3RiYWxsL2VuZ2xpc2gvcHJlbWllci1sZWFndWUpIGluIHdoaWNoIHlvdSBjYW4gb2J0YWluIG9kZHMgZnJvbSB2YXJpb3VzIG1hdGNoZXMgdG8gYmUgcGxheWVkLiBUaGUgcGFydCB0aGF0IGZvbGxvd3MgaXMgdmVyeSBleGNpdGluZyBhcyB3ZSBhcHBseSBtYWNoaW5lIGxlYXJuaW5nIHRvIHByZWRpY3QgdGhlIGZ1bGwgdGltZSByZXN1dGwoRlRSKSBvZiB0aGUgbmV4dCBtYXRjaCBmcm9tIHRoZSBwYXN0IG9kZHMgZGF0YS4gSSB1c2VkIHR3byBNYWNoaW5lIExlYXJuaW5nIHRlY2huaXF1ZXMgbmFtZWx5IG5haXZlIGJheWVzIGFuZCBjbGFzc2lmaWNhdGlvbiB0cmVlLgoKYGBge3J9CiNGYWN0b3JpemF0aW9uCmIkRlRSPWZhY3RvcihiJEZUUikKCiNUcmFpbmluZyBhbmQgdGVzdApsaWJyYXJ5KCd2YXJoYW5kbGUnKQpndHJhaW49YgpndGVzdDwtZGF0YS5mcmFtZShndHJhaW5bLGMoMTo2KV0pCmd0ZXN0bGFiZWw8LWModW5mYWN0b3IoZ3RyYWluJEZUUikpCmF0dGFjaChndHJhaW4pCgojTmFpdmUgYmF5ZXMKbGlicmFyeSgnZTEwNzEnLHdhcm4uY29uZmxpY3RzPUZBTFNFKQpuYWl2ZV9iYXllc19tb2RlbDwtbmFpdmVCYXllcyhGVFIgfiAuLCBkYXRhID0gZ3RyYWluKQpuYWl2ZV9iYXllc19wcmVkaWN0aW9uczwtcHJlZGljdChuYWl2ZV9iYXllc19tb2RlbCwgbmV3ZGF0YT1ndGVzdCkKbmFpdmVfYmF5ZXNfYWNjdXJhY3k9cm91bmQobWVhbihuYWl2ZV9iYXllc19wcmVkaWN0aW9ucz09Z3Rlc3RsYWJlbCksMikqMTAwCgojQ2xhc3NpZmljYXRpb24gdHJlZQpsaWJyYXJ5KCdwYXJ0eScsd2Fybi5jb25mbGljdHM9RkFMU0UpCmN0cmVlX21vZGVsPC0gY3RyZWUoRlRSIH4gLiwgZGF0YSA9IGd0cmFpbixjb250cm9scz1jdHJlZV9jb250cm9sKG1pbnNwbGl0PTMwLG1pbmJ1Y2tldD0xMCxtYXhkZXB0aD01KSkKY3RyZWVfcHJlZGljdGlvbnMgPC0gcHJlZGljdChjdHJlZV9tb2RlbCxuZXdkYXRhPWd0ZXN0LHR5cGU9J3Jlc3BvbnNlJykKY3RyZWVfYWNjdXJhY3k9cm91bmQobWVhbihjdHJlZV9wcmVkaWN0aW9ucz09Z3Rlc3RsYWJlbCksMikqMTAwCgojQ3JlYXRpbmcgdGhlIGlucHV0IG9kZHMgaW5zdGFuY2UKb2Rkcz1kYXRhLmZyYW1lKEIzNjVIPScwLjcyNycsQjM2NUQ9JzIuNzUnLEIzNjVBPSc0JyxMQkg9JzAuNzI3JyxMQkQ9JzIuNicsTEJBPSczLjc1JykKCiNVc2luZyB0aGUgcHJlZGljdCBmdW5jdGlvbiB0byBtYWtlIGEgbWF0Y2ggcHJlZGljdGlvbgpwcmVkaWN0KG5haXZlX2JheWVzX21vZGVsLG5ld2RhdGE9b2Rkcyx0eXBlID0gJ2NsYXNzJyxpbnRlcnZhbD1wcmVkaWN0KQoKCgpgYGAKCmBgYHtyfQpwbG90KG5haXZlX2JheWVzX3ByZWRpY3Rpb25zLHhsYWI9J1Jlc3VsdCcseWxhYj0nTnVtYmVyIG9mIE1hdGNoZXMnKQpgYGAKCk9uIGF2ZXJhZ2UsTmFpdmUgQmF5ZXMgcGVyZm9ybWVkIGJldHRlciB0aGFuIENsYXNzaWZpY2F0aW9uIHRyZWUgYWNoaWV2aW5nIGFjY3VyYWNpZXMgYXMgaGlnaCBhcyA3MCUuIEkgYWxzbyByZWFsaXplZCB0aGF0IHVzaW5nIHRlYW0gZGF0YSBmcm9tIDIwMDAgd2lsbCBub3QgYWRkIGFueSBpbmZvcm1hdGlvbiBpbiBwcmVkaWN0aW5nIGEgMjAxNyBtYXRjaC4gSXQgd2FzIGFsc28gcXVpdGUgaW50ZXJlc3RpbmcgdG8gbm90ZSB0aGF0IGFsdGhvdWdoIHRoZXJlIHdlcmUgIDY2NDMgcHJlbWllciBsZWFndWUgbWF0Y2hlcyBwbGF5ZWQgZnJvbSB0aGUgeWVhciAyMDAwLCB5b3UgY2Fubm90IGZpbmQgbXVjaCBkYXRhIHBvaW50cyBmb3IgYW55IHR3byB0ZWFtcyBhbmQgdGhlIHByb2JsZW0gZ2V0cyBleGFzcGVyYXRlZCBpZiBvbmUgb2YgdGhlIHRlYW1zIGhhdmUgcmVjZW50bHkgYmVlbiBwcm9tb3RlZCBmcm9tIGEgbG93ZXIgbGVhZ3VlLiBGb3IgaW5zdGFuY2UsdGhlIENoZWxzZWEgVlMgTWFuIENpdHkgbWF0Y2ggZ2VuZXJhdGVkIG9ubHkgMTYgZGF0YSBwb2ludHMgb3V0IG9mIDY2NDMgbWF0Y2hlcyB0YWtpbmcgaW50byBhY2NvdW50IHRoZSBob21lIGFuZCBhd2F5IHRlYW0gZmVhdHVyZS4gSSB0aHVzIHJlc29ydGVkIHRvIGFuYWx5emluZyB0aGUgbWF0Y2hlcyB1c2luZyBTUUwgdG8gY2FsY3VsYXRlIHRoZSBwZXJjZW50YWdlIG9mIHdpbnMgZm9yIGFueSBob21lIHRlYW0uICBUaGlzIG1pZ2h0IGdpdmUgYW4gaW5kaWNhdGlvbiBvZiB0aGUgbmV4dCBtYXRjaC4KCmBgYHtyfQpzPXNxbGRmKCJTRUxFQ1QgRlRSIEFTICdSZXN1bHQnLENPVU5UKEZUUikgQVMgJ0NvdW50cycgCiAgICAgICAgRlJPTSB0IEdST1VQIEJZIEZUUiIpCnA9c3FsZGYoIlNFTEVDVCBSZXN1bHQsQ291bnRzLFJPVU5EKENvdW50cyoxMDAuMC8oU0VMRUNUIFNVTShDb3VudHMpIEZST00gcyksMCkgQVMgJ1BlcmNlbnRhZ2UnIEZST00gcyBHUk9VUCBCWSBSZXN1bHQiKQpwCmBgYApNYXRjaCBwcmVkaWN0aW9uIGlzIHZlcnkgaGFyZCB0aHVzIE1hY2hpbmUgTGVhcm5pbmcgY2Fubm90IHByb3ZpZGUgdXMgd2l0aCB0aGUgZGVzaXJlZCBsZXZlbCBvZiBhY2N1cmFjeS4gVGhpcyBpcyBtYWlubHkgZHVlIHRvIHRoZSBmYWN0IHRoYXQgdGhlcmUgaXMgc2ltcGx5IG5vdCBlbm91Z2ggZGF0YSBhbmQgdGhlIHRlYW1zIGFyZSBjb25zdGFudGx5IGNoYW5naW5nLiBUaGUgYmVzdCB3ZSBjYW4gaG9wZSBmb3IgaXMgdG8gZm9sbG93IG91ciBpbnN0aW5jdHMgYW5kIGNyb3NzIG91ciBmaW5nZXJzIQoK