proj-classification-home.knit

Introduction

The project is aimed at predicting whether a team was a host of a match or a guest. To achieve this we gathered the data for seven seasons of the English Premier League. Our self-built web scraper was utilized for this task.

Similar topics were investigated by some scholars in the past, e.g. Carmichael and Thomas (2005) assessed the existence of a home field effect. They conducted the analysis for English Premier League and found that attacking play is more productive for home teams and that defensive play is more important for away teams.

We utilized a handful of ML methods, as well as the classical logistic regression, which served as a benchmark. Exact methods are:

Logistic regression
Decision trees
Random forests
Extreme Gradient Boosting
Neural networks (in two implementations: neuralnet and keras)

Read and transform data

library(readr)
matches_org <- read_delim("meczeTrans.csv", delim = ";", escape_double = FALSE, locale = locale(decimal_mark = ","), trim_ws = TRUE)

# Every other row contains statistics either for a home or away team.
home = rep(c(1, 0), times = length(matches_org$MatchID)/2)
matches_org$home = home

# dim(matches_org)

matches_s = subset(matches_org, select= -c(...1, MatchID, MatchDate, Week, TeamName))

# Convert RedCards to binary variable

matches_s$RedCards[matches_s$RedCards > 0] = 1

matchesF = matches_s

dim(matchesF)

## [1] 5320   13

head(matchesF)

## # A tibble: 6 x 13
##   GoalsHT GoalsFT BallPos ShotsOffTarget ShotsOnTarget BlockedShots Corners
##     <dbl>   <dbl>   <dbl>          <dbl>         <dbl>        <dbl>   <dbl>
## 1       0       1    0.6               5             5            4       4
## 2       1       2    0.4               0             4            1       0
## 3       0       0    0.51              7             6            6       8
## 4       0       1    0.49              3             4            4       9
## 5       0       0    0.63              4             2            6       2
## 6       0       1    0.37              4             1            2       8
## # ... with 6 more variables: PassSuccPerc <dbl>, AerialsWon <dbl>, Fouls <dbl>,
## #   YellowCards <dbl>, RedCards <dbl>, home <dbl>

Each row in the data set contains statistics for one team. Two consecutive rows (starting from the first one) contain statistics of one match – for home and away team of this match.

There are 17 variables in the original data set:

MatchID – Number of the match in the data set
MatchDate – Calendar date of the played match
Week – Week of the season
TeamName – name of the team
GoalsHT – Number of goals scored in the first half of the match
GoalsFT – Number of goals scored till the end of the match
BallPos – Percentage of ball possession
ShotsOffTarget – Number of shots off the goal
ShotsOnTarget – Number of shots on the goal
BlockedShots – Number of shots blocked by the opponent
Corners – Number of corners
PassSuccPerc – Percentage of successful passes
AerialsWon – Number of aerial duels won
Fouls – Number of fouls committed
YellowCards – Number of yellow cards
RedCards – Number of red card
home – dependent binary variable (whether a team was a host or a guest)

First four were discarded since they are irrelevant for the sake of conducted analysis.

Due to a very low cases of more than 1 red cards given during a match this variable was recoded into binary one which says if the team got red card or not. / / High correlation is not very harmful for the utilized ML methods, however it should be addressed in the beginning.

library(corrplot)
library(vtable)

testRes = cor.mtest(matchesF, conf.level = 0.95) # Significance of the correlation
corrplot(cor(matchesF), p.mat = testRes$p, type = 'lower', method = 'number',
         insig = 'blank',  tl.cex=3)

# myTable = summary_table(matchesF)
# myTable
# table(matchesF)

Number of goals scored in the first half and during the whole match are highly correlated (0.7) – this result is intuitive. High correlation is also observed for ball possession and percentage of successful passes – this is also intuitive since if a team has high possession its players must pass the ball to each other very often and with high accuracy.

sumtable(matchesF)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
GoalsHT	5320	0.608	0.802	0	0	1	5
GoalsFT	5320	1.356	1.248	0	0	2	9
BallPos	5320	0.5	0.125	0.17	0.41	0.59	0.83
ShotsOffTarget	5320	4.877	2.609	0	3	6	16
ShotsOnTarget	5320	4.279	2.487	0	2	6	17
BlockedShots	5320	3.43	2.415	0	2	5	19
Corners	5320	5.24	2.962	0	3	7	19
PassSuccPerc	5320	0.776	0.076	0.48	0.73	0.83	0.94
AerialsWon	5320	18.201	7.362	1	13	23	67
Fouls	5320	10.814	3.481	0	8	13	26
YellowCards	5320	1.665	1.256	0	1	2	9
RedCards	5320	0.037	0.188	0	0	0	1
home	5320	0.5	0.5	0	0	1	1

Number of observations equals 5320. The lowest mean value has the variable saying how many red cards were given during a match (0.037), the highest mean value has the variable informing about the number of aerial duels won by the team (18.2). Variable representing the number of red cards given is very dispersed – its standard deviation is about 5 times bigger than its mean value.

table(matchesF$home)

## 
##    0    1 
## 2660 2660

The data is perfectly balanced since for every host team there must be one guest team.

We divide the data set into the training and testing sample – training one is 0.7 of the original data set. Other 0.3 is reserved for the testing purposes.

library(caret)
set.seed(123456789)
training_obs <- createDataPartition(matchesF$home, 
                                    p = 0.7, 
                                    list = FALSE) 
matches.train <- matchesF[training_obs,]
matches.test  <- matchesF[-training_obs,]

Let us define the formula of the model used for every method.

model1.formula <- home ~ GoalsHT + GoalsFT + BallPos + ShotsOffTarget + ShotsOnTarget +
  BlockedShots + Corners + PassSuccPerc + AerialsWon + Fouls + YellowCards + RedCards

Logistic regression

First we use the classical econometric method which will serve as a benchmark for other methods used.

matches.logit = glm(formula = model1.formula, data = matches.train, family = binomial(link = 'logit'))
summary(matches.logit)

## 
## Call:
## glm(formula = model1.formula, family = binomial(link = "logit"), 
##     data = matches.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1666  -1.0983  -0.1234   1.1303   1.7621  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.857702   0.543155  -1.579  0.11431    
## GoalsHT         0.118168   0.060397   1.957  0.05041 .  
## GoalsFT         0.075215   0.044843   1.677  0.09349 .  
## BallPos        -0.859233   0.498681  -1.723  0.08489 .  
## ShotsOffTarget  0.102572   0.015165   6.764 1.35e-11 ***
## ShotsOnTarget   0.053454   0.018992   2.815  0.00489 ** 
## BlockedShots    0.072680   0.017614   4.126 3.69e-05 ***
## Corners         0.073989   0.014765   5.011 5.41e-07 ***
## PassSuccPerc   -0.071209   0.808122  -0.088  0.92978    
## AerialsWon      0.001282   0.005495   0.233  0.81548    
## Fouls          -0.005891   0.010765  -0.547  0.58418    
## YellowCards    -0.094950   0.029946  -3.171  0.00152 ** 
## RedCards        0.061361   0.182208   0.337  0.73630    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5162.6  on 3723  degrees of freedom
## Residual deviance: 4877.9  on 3711  degrees of freedom
## AIC: 4903.9
## 
## Number of Fisher Scoring iterations: 4

Many of the variables in the model are statistically significant.

Percentage of successful passes, number of aerial duels won, number of fouls commited and getting a red card are not statistically significant for predicting if the team was a host or a guest.

Look at confusion matrices:

pred.logit.train <- predict(matches.logit,
                      matches.train,
                      type = "response")

logit.train.confMat = confusionMatrix(data = as.factor(ifelse(pred.logit.train > 0.5, 1, 0)), # predictions
                # actual values
                reference = as.factor(matches.train$home),
                # definitions of the "success" label
                positive = "1") 
logit.train.confMat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1204  804
##          1  658 1058
##                                           
##                Accuracy : 0.6074          
##                  95% CI : (0.5915, 0.6231)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2148          
##                                           
##  Mcnemar's Test P-Value : 0.0001493       
##                                           
##             Sensitivity : 0.5682          
##             Specificity : 0.6466          
##          Pos Pred Value : 0.6166          
##          Neg Pred Value : 0.5996          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2841          
##    Detection Prevalence : 0.4608          
##       Balanced Accuracy : 0.6074          
##                                           
##        'Positive' Class : 1               
##

pred.logit.test <- predict(matches.logit,
                      matches.test,
                      type = "response")

confusionMatrix.logit.test = confusionMatrix(data = as.factor(ifelse(pred.logit.test > 0.5, 1, 0)), # predictions
                # actual values
                reference = as.factor(matches.test$home),
                # definitions of the "success" label
                positive = "1") 
confusionMatrix.logit.test

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 519 353
##          1 279 445
##                                           
##                Accuracy : 0.604           
##                  95% CI : (0.5795, 0.6281)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.208           
##                                           
##  Mcnemar's Test P-Value : 0.003687        
##                                           
##             Sensitivity : 0.5576          
##             Specificity : 0.6504          
##          Pos Pred Value : 0.6146          
##          Neg Pred Value : 0.5952          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2788          
##    Detection Prevalence : 0.4536          
##       Balanced Accuracy : 0.6040          
##                                           
##        'Positive' Class : 1               
##

Accuracy for both, training and testing data samples are around 0.6.

Gini coefficient plot:

library(pROC)

pred.train.logit = predict(matches.logit, matches.train, type = "response")
ROC.logit.train = roc((matches.train$home == 1), 
                        pred.train.logit)

pred.test.logit = predict(matches.logit, matches.test, type = "response")
ROC.logit.test = roc((matches.test$home == 1), 
                        pred.test.logit)

list(
  ROC.logit.train = ROC.logit.train,
  ROC.logit.test  = ROC.logit.test
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         round(100*(2 * auc(ROC.logit.train) - 1), 1), "%, ",
                         "Gini TEST:  ", 
                         round(100*(2 * auc(ROC.logit.test) - 1), 1), "%, ")) +
  theme_bw() + coord_fixed() +
  # scale_color_brewer(palette = "Paired") +
  scale_color_manual(values = RColorBrewer::brewer.pal(n = 4, 
                                                       name = "Paired")[c(1, 3, 5)])

The model is definitely not overtrained – Gini coefficient values are almost the same for both samples.

Classification tree

The second method used are classification trees.

library(rpart)
library(rpart.plot)
matches.tree <- 
  rpart(model1.formula, # model formula
        data = matches.train, # data
        method = "class") # type of the tree: classification

library(rattle)

matches.tree

## n= 3724 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 3724 1862 0 (0.5000000 0.5000000)  
##    2) Corners< 6.5 2572 1149 0 (0.5532659 0.4467341)  
##      4) ShotsOffTarget< 4.5 1456  575 0 (0.6050824 0.3949176) *
##      5) ShotsOffTarget>=4.5 1116  542 1 (0.4856631 0.5143369)  
##       10) ShotsOnTarget< 5.5 834  395 0 (0.5263789 0.4736211)  
##         20) BlockedShots< 3.5 525  227 0 (0.5676190 0.4323810) *
##         21) BlockedShots>=3.5 309  141 1 (0.4563107 0.5436893) *
##       11) ShotsOnTarget>=5.5 282  103 1 (0.3652482 0.6347518) *
##    3) Corners>=6.5 1152  439 1 (0.3810764 0.6189236) *

# rpart.plot(matches.tree)
fancyRpartPlot(matches.tree)

We have 5 terminal nodes.

summary(matches.tree)

## Call:
## rpart(formula = model1.formula, data = matches.train, method = "class")
##   n= 3724 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.14715360      0 1.0000000 1.0477981 0.01636811
## 2 0.02040816      1 0.8528464 0.8689581 0.01624553
## 3 0.01450054      3 0.8120301 0.8453276 0.01618964
## 4 0.01000000      4 0.7975295 0.8367347 0.01616696
## 
## Variable importance
##        Corners ShotsOffTarget   BlockedShots  ShotsOnTarget        BallPos 
##             45             20             12             12              7 
##   PassSuccPerc        GoalsFT        GoalsHT 
##              2              2              1 
## 
## Node number 1: 3724 observations,    complexity param=0.1471536
##   predicted class=0  expected loss=0.5  P(node) =1
##     class counts:  1862  1862
##    probabilities: 0.500 0.500 
##   left son=2 (2572 obs) right son=3 (1152 obs)
##   Primary splits:
##       Corners        < 6.5   to the left,  improve=47.17994, (0 missing)
##       ShotsOffTarget < 4.5   to the left,  improve=40.65040, (0 missing)
##       ShotsOnTarget  < 4.5   to the left,  improve=38.49781, (0 missing)
##       BlockedShots   < 4.5   to the left,  improve=33.99681, (0 missing)
##       BallPos        < 0.395 to the left,  improve=23.75711, (0 missing)
##   Surrogate splits:
##       BlockedShots   < 5.5   to the left,  agree=0.733, adj=0.138, (0 split)
##       BallPos        < 0.645 to the left,  agree=0.726, adj=0.115, (0 split)
##       ShotsOffTarget < 9.5   to the left,  agree=0.710, adj=0.062, (0 split)
##       ShotsOnTarget  < 8.5   to the left,  agree=0.697, adj=0.021, (0 split)
##       PassSuccPerc   < 0.875 to the left,  agree=0.693, adj=0.008, (0 split)
## 
## Node number 2: 2572 observations,    complexity param=0.02040816
##   predicted class=0  expected loss=0.4467341  P(node) =0.6906552
##     class counts:  1423  1149
##    probabilities: 0.553 0.447 
##   left son=4 (1456 obs) right son=5 (1116 obs)
##   Primary splits:
##       ShotsOffTarget < 4.5   to the left,  improve=18.01913, (0 missing)
##       ShotsOnTarget  < 4.5   to the left,  improve=17.03385, (0 missing)
##       GoalsFT        < 1.5   to the left,  improve=14.84674, (0 missing)
##       BlockedShots   < 3.5   to the left,  improve=14.27694, (0 missing)
##       GoalsHT        < 0.5   to the left,  improve=10.63054, (0 missing)
##   Surrogate splits:
##       BallPos       < 0.515 to the left,  agree=0.617, adj=0.117, (0 split)
##       PassSuccPerc  < 0.825 to the left,  agree=0.600, adj=0.078, (0 split)
##       BlockedShots  < 4.5   to the left,  agree=0.593, adj=0.063, (0 split)
##       Corners       < 5.5   to the left,  agree=0.584, adj=0.040, (0 split)
##       ShotsOnTarget < 8.5   to the left,  agree=0.576, adj=0.023, (0 split)
## 
## Node number 3: 1152 observations
##   predicted class=1  expected loss=0.3810764  P(node) =0.3093448
##     class counts:   439   713
##    probabilities: 0.381 0.619 
## 
## Node number 4: 1456 observations
##   predicted class=0  expected loss=0.3949176  P(node) =0.3909774
##     class counts:   881   575
##    probabilities: 0.605 0.395 
## 
## Node number 5: 1116 observations,    complexity param=0.02040816
##   predicted class=1  expected loss=0.4856631  P(node) =0.2996778
##     class counts:   542   574
##    probabilities: 0.486 0.514 
##   left son=10 (834 obs) right son=11 (282 obs)
##   Primary splits:
##       ShotsOnTarget  < 5.5   to the left,  improve=10.943020, (0 missing)
##       BlockedShots   < 4.5   to the left,  improve= 8.037236, (0 missing)
##       ShotsOffTarget < 7.5   to the left,  improve= 5.699053, (0 missing)
##       GoalsFT        < 1.5   to the left,  improve= 5.278801, (0 missing)
##       Corners        < 4.5   to the left,  improve= 4.547056, (0 missing)
##   Surrogate splits:
##       GoalsFT        < 2.5   to the left,  agree=0.793, adj=0.181, (0 split)
##       GoalsHT        < 2.5   to the left,  agree=0.762, adj=0.057, (0 split)
##       PassSuccPerc   < 0.895 to the left,  agree=0.760, adj=0.050, (0 split)
##       BallPos        < 0.735 to the left,  agree=0.750, adj=0.011, (0 split)
##       ShotsOffTarget < 12.5  to the left,  agree=0.749, adj=0.007, (0 split)
## 
## Node number 10: 834 observations,    complexity param=0.01450054
##   predicted class=0  expected loss=0.4736211  P(node) =0.2239527
##     class counts:   439   395
##    probabilities: 0.526 0.474 
##   left son=20 (525 obs) right son=21 (309 obs)
##   Primary splits:
##       BlockedShots   < 3.5   to the left,  improve=4.819893, (0 missing)
##       GoalsFT        < 3.5   to the left,  improve=4.188806, (0 missing)
##       ShotsOffTarget < 10.5  to the left,  improve=2.907622, (0 missing)
##       BallPos        < 0.285 to the left,  improve=1.915287, (0 missing)
##       Corners        < 0.5   to the left,  improve=1.557683, (0 missing)
##   Surrogate splits:
##       Corners        < 4.5   to the left,  agree=0.658, adj=0.078, (0 split)
##       BallPos        < 0.635 to the left,  agree=0.652, adj=0.061, (0 split)
##       ShotsOffTarget < 7.5   to the left,  agree=0.645, adj=0.042, (0 split)
##       PassSuccPerc   < 0.895 to the left,  agree=0.638, adj=0.023, (0 split)
##       AerialsWon     < 5.5   to the right, agree=0.632, adj=0.006, (0 split)
## 
## Node number 11: 282 observations
##   predicted class=1  expected loss=0.3652482  P(node) =0.07572503
##     class counts:   103   179
##    probabilities: 0.365 0.635 
## 
## Node number 20: 525 observations
##   predicted class=0  expected loss=0.432381  P(node) =0.1409774
##     class counts:   298   227
##    probabilities: 0.568 0.432 
## 
## Node number 21: 309 observations
##   predicted class=1  expected loss=0.4563107  P(node) =0.0829753
##     class counts:   141   168
##    probabilities: 0.456 0.544

pred.tree <- predict(matches.tree,
                      matches.train,
                      type = "class")

confusionMatrix(data = pred.tree, # predictions
                # actual values
                reference = as.factor(matches.train$home),
                # definitions of the "success" label
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1179  802
##          1  683 1060
##                                          
##                Accuracy : 0.6012         
##                  95% CI : (0.5853, 0.617)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.2025         
##                                          
##  Mcnemar's Test P-Value : 0.002198       
##                                          
##             Sensitivity : 0.5693         
##             Specificity : 0.6332         
##          Pos Pred Value : 0.6081         
##          Neg Pred Value : 0.5952         
##              Prevalence : 0.5000         
##          Detection Rate : 0.2846         
##    Detection Prevalence : 0.4680         
##       Balanced Accuracy : 0.6012         
##                                          
##        'Positive' Class : 1              
##

The accuracy of this tree is equal to 0.6 – the same as for logistic regression.

Lets try some bigger one.

Big tree:

matches.tree.big <- 
  rpart(model1.formula,
        data = matches.train,
        method = "class",
        minsplit = 500, # ~ 2% of the training set
        minbucket = 250, # ~ 1% of the training set
        maxdepth = 30, # default
        cp = -1)
matches.tree.big

## n= 3724 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 3724 1862 0 (0.5000000 0.5000000)  
##    2) Corners< 6.5 2572 1149 0 (0.5532659 0.4467341)  
##      4) ShotsOffTarget< 4.5 1456  575 0 (0.6050824 0.3949176)  
##        8) GoalsHT< 0.5 774  264 0 (0.6589147 0.3410853)  
##         16) Fouls>=9.5 485  151 0 (0.6886598 0.3113402) *
##         17) Fouls< 9.5 289  113 0 (0.6089965 0.3910035) *
##        9) GoalsHT>=0.5 682  311 0 (0.5439883 0.4560117)  
##         18) AerialsWon>=15.5 386  160 0 (0.5854922 0.4145078) *
##         19) AerialsWon< 15.5 296  145 1 (0.4898649 0.5101351) *
##      5) ShotsOffTarget>=4.5 1116  542 1 (0.4856631 0.5143369)  
##       10) ShotsOnTarget< 5.5 834  395 0 (0.5263789 0.4736211)  
##         20) BlockedShots< 3.5 525  227 0 (0.5676190 0.4323810)  
##           40) AerialsWon< 18.5 260  107 0 (0.5884615 0.4115385) *
##           41) AerialsWon>=18.5 265  120 0 (0.5471698 0.4528302) *
##         21) BlockedShots>=3.5 309  141 1 (0.4563107 0.5436893) *
##       11) ShotsOnTarget>=5.5 282  103 1 (0.3652482 0.6347518) *
##    3) Corners>=6.5 1152  439 1 (0.3810764 0.6189236)  
##      6) ShotsOffTarget< 7.5 842  354 1 (0.4204276 0.5795724)  
##       12) BallPos< 0.605 553  255 1 (0.4611212 0.5388788)  
##         24) ShotsOffTarget< 4.5 255  126 1 (0.4941176 0.5058824) *
##         25) ShotsOffTarget>=4.5 298  129 1 (0.4328859 0.5671141) *
##       13) BallPos>=0.605 289   99 1 (0.3425606 0.6574394) *
##      7) ShotsOffTarget>=7.5 310   85 1 (0.2741935 0.7258065) *

fancyRpartPlot(matches.tree.big)

pred.tree.big <- predict(matches.tree.big,
                      matches.train,
                      type = "class")

confusionMatrix(data = pred.tree.big, # predictions
                # actual values
                reference = as.factor(matches.train$home),
                # definitions of the "success" label
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1034  651
##          1  828 1211
##                                           
##                Accuracy : 0.6028          
##                  95% CI : (0.5869, 0.6186)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2057          
##                                           
##  Mcnemar's Test P-Value : 4.729e-06       
##                                           
##             Sensitivity : 0.6504          
##             Specificity : 0.5553          
##          Pos Pred Value : 0.5939          
##          Neg Pred Value : 0.6136          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3252          
##    Detection Prevalence : 0.5475          
##       Balanced Accuracy : 0.6028          
##                                           
##        'Positive' Class : 1               
##

Letting a tree to grow bigger didn’t result in much better results.

Now the bigger tree will be pruned.

opt <- which.min(matches.tree.big$cptable[, "xerror"])
cp <- matches.tree.big$cptable[opt, "CP"]

matches.tree.pruned <- 
  prune(matches.tree.big, cp = cp)

matches.tree.pruned

## n= 3724 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 3724 1862 0 (0.5000000 0.5000000)  
##    2) Corners< 6.5 2572 1149 0 (0.5532659 0.4467341)  
##      4) ShotsOffTarget< 4.5 1456  575 0 (0.6050824 0.3949176)  
##        8) GoalsHT< 0.5 774  264 0 (0.6589147 0.3410853) *
##        9) GoalsHT>=0.5 682  311 0 (0.5439883 0.4560117)  
##         18) AerialsWon>=15.5 386  160 0 (0.5854922 0.4145078) *
##         19) AerialsWon< 15.5 296  145 1 (0.4898649 0.5101351) *
##      5) ShotsOffTarget>=4.5 1116  542 1 (0.4856631 0.5143369)  
##       10) ShotsOnTarget< 5.5 834  395 0 (0.5263789 0.4736211)  
##         20) BlockedShots< 3.5 525  227 0 (0.5676190 0.4323810) *
##         21) BlockedShots>=3.5 309  141 1 (0.4563107 0.5436893) *
##       11) ShotsOnTarget>=5.5 282  103 1 (0.3652482 0.6347518) *
##    3) Corners>=6.5 1152  439 1 (0.3810764 0.6189236) *

fancyRpartPlot(matches.tree.pruned)

The pruned tree is identical to the first developed tree.

pred.tree.train = predict(matches.tree, matches.train)
pred.big.tree.train  <- predict(matches.tree.big,  matches.train)
pred.tree.pruned.train <- predict(matches.tree.pruned, matches.train)

ROC.tree.train = roc((matches.train$home == 1), 
                        pred.tree.train[, 1])

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls > cases

ROC.tree.big.train  <- roc((matches.train$home == 1), 
                        pred.big.tree.train[, 1])

## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls > cases

ROC.tree.pruned.train <- roc((matches.train$home == 1), 
                        pred.tree.pruned.train[, 1])

## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls > cases

list(
  ROC.tree.train = ROC.tree.train,
  ROC.tree.big.train  = ROC.tree.big.train,
  ROC.tree.pruned.train = ROC.tree.pruned.train
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "tree = ",
                         round(100*(2 * auc(ROC.tree.train) - 1), 1), "%, ",
                         "big = ", 
                         round(100*(2 * auc(ROC.tree.big.train) - 1), 1), "%, ",
                         "pruned = ", 
                         round(100*(2 * auc(ROC.tree.pruned.train) - 1), 1), "% ")) +
  theme_bw() + coord_fixed() +
  # scale_color_brewer(palette = "Paired") +
  scale_color_manual(values = RColorBrewer::brewer.pal(n = 4, 
                                                       name = "Paired")[c(1, 3, 5)])

Pruned big tree gives better results than the first tree.

Now for the testing set.

pred.tree.test = predict(matches.tree, matches.test)
pred.tree.big.test  <- predict(matches.tree.big,  matches.test)
pred.tree.pruned.test <- predict(matches.tree.pruned, matches.test)

ROC.tree.test = roc((matches.test$home == 1), 
                        pred.tree.test[, 1])

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls > cases

ROC.big.tree.test  <- roc((matches.test$home == 1), 
                        pred.tree.big.test[, 1])

## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls > cases

ROC.tree.pruned.test <- roc((matches.test$home == 1), 
                        pred.tree.pruned.test[, 1])

## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls > cases

list(
  ROC.tree.train = ROC.tree.train,
  ROC.tree.big.train  = ROC.tree.big.train,
  ROC.tree.pruned.train = ROC.tree.pruned.train,
  ROC.tree.test  = ROC.tree.test,
  ROC.big.tree.test = ROC.big.tree.test,
  ROC.tree.pruned.test = ROC.tree.pruned.test
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "tree = ",
                         round(100*(2 * auc(ROC.tree.train) - 1), 1), "%, ",
                         "big = ", 
                         round(100*(2 * auc(ROC.tree.big.train) - 1), 1), "%, ",
                         "pruned = ", 
                         round(100*(2 * auc(ROC.tree.pruned.train) - 1), 1), "% ",
                         "Gini TEST: ",
                         "tree = ",
                         round(100*(2 * auc(ROC.tree.test) - 1), 1), "%, ",
                         "big = ", 
                         round(100*(2 * auc(ROC.big.tree.test) - 1), 1), "%, ",
                         "pruned = ", 
                         round(100*(2 * auc(ROC.tree.pruned.test) - 1), 1), "% "
                         )) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

  # scale_color_manual(values = RColorBrewer::brewer.pal(n = 4, name = "Paired")[c(1, 3, 5)])

For the test data set also the pruned big tree works best.

confusionMatrix.tree.pruned.test = confusionMatrix(data = predict(matches.tree.pruned,
                                                                  matches.test,
                                                                  type = "class"), # predictions
                                                   # actual values
                                                   reference = as.factor(matches.test$home),
                                                   # definitions of the "success" label
                                                   positive = "1")

Variable importance

Lets check which variables are the most important for the best tree.

tree.pruned.importance <- matches.tree.pruned$variable.importance

par(mar = c(5.1, 6.1, 4.1, 2.1))
barplot(rev(tree.pruned.importance), # vactor
        col = "blue",  # colors
        main = "imporatnce of variables in model matches.tree.pruned",
        horiz = T,  # horizontal type of plot
        las = 1,    # labels always horizontally 
        cex.names = 0.6)

We see that the most important is the number of corners, number of shots off and on target, as well as number of blocked shots. When we compare this results to the statistics of logistic regression we’ll see that the order of variables is the same.

Random forest

Next step was to develop a random forest.

## Comment from classes
# For the classification analysis (nominal target variable) we have to transform the dependent variable to factor. Otherwise, if we leave the target variable as numeric - the regression analysis will be performed.

str(matches.train) # Now the 'home' variable is numeric

## tibble [3,724 x 13] (S3: tbl_df/tbl/data.frame)
##  $ GoalsHT       : num [1:3724] 0 0 0 0 0 1 1 1 2 0 ...
##  $ GoalsFT       : num [1:3724] 1 0 1 0 1 2 2 2 2 0 ...
##  $ BallPos       : num [1:3724] 0.6 0.51 0.49 0.63 0.37 0.58 0.42 0.37 0.63 0.47 ...
##  $ ShotsOffTarget: num [1:3724] 5 7 3 4 4 5 2 5 5 10 ...
##  $ ShotsOnTarget : num [1:3724] 5 6 4 2 1 5 2 3 3 4 ...
##  $ BlockedShots  : num [1:3724] 4 6 4 6 2 0 3 3 5 4 ...
##  $ Corners       : num [1:3724] 4 8 9 2 8 6 3 3 6 8 ...
##  $ PassSuccPerc  : num [1:3724] 0.86 0.77 0.76 0.84 0.68 0.8 0.75 0.77 0.84 0.83 ...
##  $ AerialsWon    : num [1:3724] 20 30 15 30 9 16 15 27 14 15 ...
##  $ Fouls         : num [1:3724] 14 10 10 14 9 18 9 16 10 12 ...
##  $ YellowCards   : num [1:3724] 2 1 2 0 3 3 1 1 1 2 ...
##  $ RedCards      : num [1:3724] 0 0 0 0 0 0 0 0 0 0 ...
##  $ home          : num [1:3724] 1 1 0 1 0 1 0 1 0 1 ...

matches.train$home = as.factor(matches.train$home)
matches.test$home = as.factor(matches.test$home)

str(matches.train)

## tibble [3,724 x 13] (S3: tbl_df/tbl/data.frame)
##  $ GoalsHT       : num [1:3724] 0 0 0 0 0 1 1 1 2 0 ...
##  $ GoalsFT       : num [1:3724] 1 0 1 0 1 2 2 2 2 0 ...
##  $ BallPos       : num [1:3724] 0.6 0.51 0.49 0.63 0.37 0.58 0.42 0.37 0.63 0.47 ...
##  $ ShotsOffTarget: num [1:3724] 5 7 3 4 4 5 2 5 5 10 ...
##  $ ShotsOnTarget : num [1:3724] 5 6 4 2 1 5 2 3 3 4 ...
##  $ BlockedShots  : num [1:3724] 4 6 4 6 2 0 3 3 5 4 ...
##  $ Corners       : num [1:3724] 4 8 9 2 8 6 3 3 6 8 ...
##  $ PassSuccPerc  : num [1:3724] 0.86 0.77 0.76 0.84 0.68 0.8 0.75 0.77 0.84 0.83 ...
##  $ AerialsWon    : num [1:3724] 20 30 15 30 9 16 15 27 14 15 ...
##  $ Fouls         : num [1:3724] 14 10 10 14 9 18 9 16 10 12 ...
##  $ YellowCards   : num [1:3724] 2 1 2 0 3 3 1 1 1 2 ...
##  $ RedCards      : num [1:3724] 0 0 0 0 0 0 0 0 0 0 ...
##  $ home          : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 1 2 ...

levels(matches.train$home) = c('no', 'yes')
levels(matches.test$home) = c('no', 'yes')

str(matches.train) # Change from '0/1' to 'no/yes'

## tibble [3,724 x 13] (S3: tbl_df/tbl/data.frame)
##  $ GoalsHT       : num [1:3724] 0 0 0 0 0 1 1 1 2 0 ...
##  $ GoalsFT       : num [1:3724] 1 0 1 0 1 2 2 2 2 0 ...
##  $ BallPos       : num [1:3724] 0.6 0.51 0.49 0.63 0.37 0.58 0.42 0.37 0.63 0.47 ...
##  $ ShotsOffTarget: num [1:3724] 5 7 3 4 4 5 2 5 5 10 ...
##  $ ShotsOnTarget : num [1:3724] 5 6 4 2 1 5 2 3 3 4 ...
##  $ BlockedShots  : num [1:3724] 4 6 4 6 2 0 3 3 5 4 ...
##  $ Corners       : num [1:3724] 4 8 9 2 8 6 3 3 6 8 ...
##  $ PassSuccPerc  : num [1:3724] 0.86 0.77 0.76 0.84 0.68 0.8 0.75 0.77 0.84 0.83 ...
##  $ AerialsWon    : num [1:3724] 20 30 15 30 9 16 15 27 14 15 ...
##  $ Fouls         : num [1:3724] 14 10 10 14 9 18 9 16 10 12 ...
##  $ YellowCards   : num [1:3724] 2 1 2 0 3 3 1 1 1 2 ...
##  $ RedCards      : num [1:3724] 0 0 0 0 0 0 0 0 0 0 ...
##  $ home          : Factor w/ 2 levels "no","yes": 2 2 1 2 1 2 1 2 1 2 ...

Build random forest with default values.

library(randomForest)
set.seed(123456789)
matches.rf <- randomForest(model1.formula, 
                           data = matches.train)

print(matches.rf)

## 
## Call:
##  randomForest(formula = model1.formula, data = matches.train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 41.49%
## Confusion matrix:
##       no  yes class.error
## no  1135  727   0.3904404
## yes  818 1044   0.4393126

plot(matches.rf)

The OOB error rate is 41.49%. Lets try some bigger forest.

matches.rf2 <- 
  randomForest(model1.formula,
               data = matches.train,
               ntree = 500,
               sampsize = nrow(matches.train),
               mtry = 8, # number of predictors sampled for splitting
               nodesize = 200, # minimum number of obs in the terminal nodes
               importance = TRUE) # we also generate predictors importance measures
print(matches.rf2)

## 
## Call:
##  randomForest(formula = model1.formula, data = matches.train,      ntree = 500, sampsize = nrow(matches.train), mtry = 8, nodesize = 200,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 39.77%
## Confusion matrix:
##       no  yes class.error
## no  1185  677   0.3635875
## yes  804 1058   0.4317938

plot(matches.rf2)

The OOB error rate is slightly lower – 39.77%.

Now optimize hyperparameters.

mtry parameter optimization:

parameters_rf <- expand.grid(mtry = 2:12)
ctrl_oob <- trainControl(method = "oob", classProbs = TRUE)

library(here)

## here() starts at C:/Users/bpop/OneDrive/R/ML2/proj

  set.seed(123456789)
  matches.rf3 <-
    train(model1.formula,
          data = matches.train,
          method = "rf",
          ntree = 100,
          nodesize = 100,
          tuneGrid = parameters_rf,
          trControl = ctrl_oob,
          importance = TRUE)
  # saving the object to the external file
  # saveRDS(object = matches.rf3,
  #         file   = here("matches.rf3.rds"))
# matches.rf3 <- readRDS(here("matches.rf3.rds"))

matches.rf3

## Random Forest 
## 
## 3724 samples
##   12 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.5923738  0.1847476
##    3    0.6033835  0.2067669
##    4    0.5950591  0.1901182
##    5    0.5945220  0.1890440
##    6    0.6068743  0.2137487
##    7    0.5996241  0.1992481
##    8    0.5982814  0.1965628
##    9    0.5931794  0.1863588
##   10    0.5950591  0.1901182
##   11    0.5972073  0.1944146
##   12    0.5996241  0.1992481
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.

The highest accuracy is obtained for mtry = 6.

  plot(matches.rf3$results$mtry, matches.rf3$results$Accuracy, type = "b")

Now we are using the ‘ranger’ package for random forest estimation and hyperparameters optimization.

parameters_ranger <- 
  expand.grid(mtry = 2:12,
              # split rule
              splitrule = "gini",
              # minimum size of the terminal node
              min.node.size = c(100, 250, 500))

ctrl_cv5 <- trainControl(method = "cv", 
                         number =    5,
                         classProbs = T)

  set.seed(123456789)
  matches.rf3a <- 
    train(model1.formula, 
          data = matches.train, 
          method = "ranger", 
          num.trees = 100, # default = 500
          # numbers of processor cores to use in computations
          num.threads = 3,
          # impurity measure
          importance = "impurity",
          # parameters
          tuneGrid = parameters_ranger, 
          trControl = ctrl_cv5)

  # saving the object to the external file
  # saveRDS(matches.rf3a, here("matches.rf3a.rds"))

  matches.rf3a

## Random Forest 
## 
## 3724 samples
##   12 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 2980, 2978, 2979, 2979, 2980 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  Accuracy   Kappa    
##    2    100            0.5964123  0.1928502
##    2    250            0.5972155  0.1944611
##    2    500            0.5896959  0.1794188
##    3    100            0.5953414  0.1907253
##    3    250            0.6001729  0.2003755
##    3    500            0.5961446  0.1923209
##    4    100            0.5945407  0.1891249
##    4    250            0.6017883  0.2036058
##    4    500            0.5977543  0.1955444
##    5    100            0.5964113  0.1928565
##    5    250            0.5958754  0.1917920
##    5    500            0.5974862  0.1950079
##    6    100            0.5982980  0.1966212
##    6    250            0.5969511  0.1939298
##    6    500            0.5990933  0.1982128
##    7    100            0.5953356  0.1907026
##    7    250            0.5942661  0.1885730
##    7    500            0.5977553  0.1955500
##    8    100            0.5974887  0.1950032
##    8    250            0.5993650  0.1987721
##    8    500            0.5953374  0.1907064
##    9    100            0.5931927  0.1864143
##    9    250            0.5956113  0.1912601
##    9    500            0.5972188  0.1944705
##   10    100            0.5910461  0.1821271
##   10    250            0.5940020  0.1880441
##   10    500            0.5974858  0.1950077
##   11    100            0.5956106  0.1912455
##   11    250            0.5958776  0.1917976
##   11    500            0.5939998  0.1880398
##   12    100            0.5934615  0.1869561
##   12    250            0.6001751  0.2003917
##   12    500            0.5980249  0.1960847
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
##  and min.node.size = 250.

  plot(matches.rf3a)

With ‘ranger’ package the best results are obtained for mtry = 4 and min.node.size = 250.

pred.rf.train <- predict(matches.rf, 
                         matches.train, 
                         type = "prob")[, "yes"]
ROC.rf.train  <- roc(as.numeric(matches.train$home == "yes"), 
                     pred.rf.train)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

pred.rf.test  <- predict(matches.rf, 
                         matches.test, 
                         type = "prob")[, "yes"]
ROC.rf.test   <- roc(as.numeric(matches.test$home == "yes"), 
                     pred.rf.test)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

pred.rf2.train <- predict(matches.rf2, 
                         matches.train, 
                         type = "prob")[, "yes"]

ROC.rf2.train  <- roc(as.numeric(matches.train$home == "yes"), 
                     pred.rf2.train)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

pred.rf2.test  <- predict(matches.rf2, 
                         matches.test, 
                         type = "prob")[, "yes"]
ROC.rf2.test   <- roc(as.numeric(matches.test$home == "yes"), 
                     pred.rf2.test)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

pred.rf3.train <- predict(matches.rf3, 
                         matches.train, 
                         type = "prob")[, "yes"]

ROC.rf3.train  <- roc(as.numeric(matches.train$home == "yes"), 
                     pred.rf3.train)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

pred.rf3.test  <- predict(matches.rf3, 
                         matches.test, 
                         type = "prob")[, "yes"]

ROC.rf3.test   <- roc(as.numeric(matches.test$home == "yes"), 
                     pred.rf3.test)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

pred.rf3a.train <- predict(matches.rf3a, 
                         matches.train, 
                         type = "prob")[, "yes"]

ROC.rf3a.train  <- roc(as.numeric(matches.train$home == "yes"), 
                     pred.rf3a.train)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

pred.rf3a.test  <- predict(matches.rf3a, 
                         matches.test, 
                         type = "prob")[, "yes"]

ROC.rf3a.test   <- roc(as.numeric(matches.test$home == "yes"), 
                     pred.rf3a.test)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

list(
  ROC.rf.train   = ROC.rf.train,
  ROC.rf.test    = ROC.rf.test,
  ROC.rf2.train  = ROC.rf2.train,
  ROC.rf2.test   = ROC.rf2.test,
  ROC.rf3.train  = ROC.rf3.train,
  ROC.rf3.test   = ROC.rf3.test,
  ROC.rf3a.train = ROC.rf3a.train,
  ROC.rf3a.test  = ROC.rf3a.test
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "rf = ", 
                         round(100 * (2 * auc(ROC.rf.train) - 1), 1), "%, ",
                         "rf2 = ", 
                         round(100 * (2 * auc(ROC.rf2.train) - 1), 1), "%, ",
                         "rf3 = ", 
                         round(100 * (2 * auc(ROC.rf3.train) - 1), 1), "%, ",
                         "rf3a = ", 
                         round(100 * (2 * auc(ROC.rf3a.train) - 1), 1), "%, ",
                         "\nGini TEST: ",
                         "rf = ", 
                         round(100 * (2 * auc(ROC.rf.test) - 1), 1), "%, ",
                         "rf2 = ", 
                         round(100 * (2 * auc(ROC.rf2.test) - 1), 1), "%, ",
                         "rf3 = ", 
                         round(100 * (2 * auc(ROC.rf3.test) - 1), 1), "%, ",
                         "rf3a = ", 
                         round(100 * (2 * auc(ROC.rf3a.test) - 1), 1), "% "
  )) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

Model built with the default values (rf) is overtrained. The best results for the testing sample gives the model rf2 – with parameters:

ntree = 500
mtry = 8
nodesize = 200

pred.rf2.test <- predict(matches.rf2,
                      matches.test,
                      type = "response")

confusionMatrix.rf2.test = confusionMatrix(data = pred.rf2.test, # predictions
                                           # actual values
                                           reference = as.factor(matches.test$home),
                                           # definitions of the "success" label
                                           positive = "yes")

varImpPlot(matches.rf2,
           sort = TRUE,
           main = "Importance of predicors",
           n.var = 12,
           type = 1) # mean decrease in accuracy

For this model the most important variables are almost the same as for the previously developed ones.

XGBoost

The next utilized ML method is Extreme Gradient Boosting (XGBoost)

parameters_xgb <- expand.grid(nrounds = seq(20, 80, 10),
                             max_depth = c(8),
                             eta = c(0.25), 
                             gamma = 1,
                             colsample_bytree = c(0.2),
                             min_child_weight = c(50),
                             subsample = 0.8)

ctrl_cv3 <- trainControl(method = "cv", 
                         number = 3,
                         classProbs = TRUE,
                         summaryFunction = twoClassSummary)

set.seed(123456789)
matches.xgb <- train(model1.formula,
                     data = matches.train,
                     method = "xgbTree",
                     trControl = ctrl_cv3,
                     tuneGrid  = parameters_xgb)

## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.

matches.xgb

## eXtreme Gradient Boosting 
## 
## 3724 samples
##   12 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 2483, 2482, 2483 
## Resampling results across tuning parameters:
## 
##   nrounds  ROC        Sens       Spec     
##   20       0.6270362  0.6074039  0.5762662
##   30       0.6324540  0.6122409  0.5730430
##   40       0.6332113  0.6116981  0.5746515
##   50       0.6338745  0.6095579  0.5708933
##   60       0.6329422  0.6031046  0.5816330
##   70       0.6312691  0.6009549  0.5730334
##   80       0.6311208  0.5998805  0.5810971
## 
## Tuning parameter 'max_depth' was held constant at a value of 8
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 50
## 
## Tuning parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 8, eta
##  = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 50 and
##  subsample = 0.8.

pred.xgb.train = predict(matches.xgb, matches.train, type = 'raw')

confusionMatrix(data = pred.xgb.train, # predictions
                # actual values
                reference = as.factor(matches.train$home),
                # definitions of the "success" label
                positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1209  724
##        yes  653 1138
##                                           
##                Accuracy : 0.6302          
##                  95% CI : (0.6145, 0.6458)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.2605          
##                                           
##  Mcnemar's Test P-Value : 0.05924         
##                                           
##             Sensitivity : 0.6112          
##             Specificity : 0.6493          
##          Pos Pred Value : 0.6354          
##          Neg Pred Value : 0.6255          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3056          
##    Detection Prevalence : 0.4809          
##       Balanced Accuracy : 0.6302          
##                                           
##        'Positive' Class : yes             
##

The accuracy is 0.63 – slightly better than for the previous models.

Hyperparameters optimization:

max_depth and min_child_weight

parameters_xgb2 <- expand.grid(nrounds = 50,
                              max_depth = seq(5, 15, 2),
                              eta = c(0.25), 
                              gamma = 1,
                              colsample_bytree = c(0.2),
                              min_child_weight = seq(50, 500, 50),
                              subsample = 0.8)

set.seed(123456789)
matches.xgb2 <- train(model1.formula,
                      data = matches.train,
                      method = "xgbTree",
                      trControl = ctrl_cv3,
                      tuneGrid  = parameters_xgb2)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

matches.xgb2

## eXtreme Gradient Boosting 
## 
## 3724 samples
##   12 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 2483, 2482, 2483 
## Resampling results across tuning parameters:
## 
##   max_depth  min_child_weight  ROC        Sens       Spec     
##    5          50               0.6378535  0.6084723  0.5800184
##    5         100               0.6389321  0.6052586  0.5821732
##    5         150               0.6366388  0.6127725  0.5902317
##    5         200               0.6363816  0.6047106  0.5923788
##    5         250               0.5341034  0.5402802  0.5279267
##    5         300               0.5000000  0.3333333  0.6666667
##    5         350               0.5000000  1.0000000  0.0000000
##    5         400               0.5000000  0.3333333  0.6666667
##    5         450               0.5000000  0.6666667  0.3333333
##    5         500               0.5000000  0.3333333  0.6666667
##    7          50               0.6362183  0.6074065  0.5837809
##    7         100               0.6412749  0.6025722  0.6068732
##    7         150               0.6373810  0.6127682  0.5870067
##    7         200               0.6341959  0.6073970  0.5843056
##    7         250               0.5343455  0.5928783  0.4586982
##    7         300               0.5000000  0.6666667  0.3333333
##    7         350               0.5000000  0.0000000  1.0000000
##    7         400               0.5000000  0.3333333  0.6666667
##    7         450               0.5000000  0.3333333  0.6666667
##    7         500               0.5000000  0.3333333  0.6666667
##    9          50               0.6317361  0.6020241  0.5735797
##    9         100               0.6415965  0.5982823  0.6084861
##    9         150               0.6365760  0.6090065  0.5810962
##    9         200               0.6349720  0.5993515  0.5902317
##    9         250               0.5247139  0.6873548  0.3620730
##    9         300               0.5000000  1.0000000  0.0000000
##    9         350               0.5000000  0.6666667  0.3333333
##    9         400               0.5000000  0.6666667  0.3333333
##    9         450               0.5000000  0.3333333  0.6666667
##    9         500               0.5000000  0.3333333  0.6666667
##   11          50               0.6326365  0.5993429  0.5848675
##   11         100               0.6386517  0.5988148  0.5966694
##   11         150               0.6345797  0.6111518  0.5896975
##   11         200               0.6329684  0.5982745  0.5886283
##   11         250               0.5367916  0.5434947  0.5300885
##   11         300               0.5000000  0.6666667  0.3333333
##   11         350               0.5000000  0.3333333  0.6666667
##   11         400               0.5000000  0.3333333  0.6666667
##   11         450               0.5000000  0.3333333  0.6666667
##   11         500               0.5000000  0.3333333  0.6666667
##   13          50               0.6366088  0.6009532  0.5821706
##   13         100               0.6416538  0.5982780  0.5923762
##   13         150               0.6360044  0.6090021  0.5870093
##   13         200               0.6340803  0.5988113  0.5923788
##   13         250               0.5423311  0.5413537  0.5327654
##   13         300               0.5000000  0.3333333  0.6666667
##   13         350               0.5000000  0.6666667  0.3333333
##   13         400               0.5000000  0.0000000  1.0000000
##   13         450               0.5000000  0.6666667  0.3333333
##   13         500               0.5000000  0.6666667  0.3333333
##   15          50               0.6359817  0.6090134  0.5816373
##   15         100               0.6391035  0.6025670  0.5907710
##   15         150               0.6358928  0.6165290  0.5768047
##   15         200               0.6354595  0.5971967  0.5848501
##   15         250               0.5367916  0.5434947  0.5300885
##   15         300               0.5000000  1.0000000  0.0000000
##   15         350               0.5000000  0.6666667  0.3333333
##   15         400               0.5000000  0.0000000  1.0000000
##   15         450               0.5000000  0.3333333  0.6666667
##   15         500               0.5000000  0.3333333  0.6666667
## 
## Tuning parameter 'nrounds' was held constant at a value of 50
## Tuning
##  'colsample_bytree' was held constant at a value of 0.2
## Tuning
##  parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 13, eta
##  = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 100
##  and subsample = 0.8.

The final values used for the model were nrounds = 50, max_depth = 13, eta = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 100 and subsample = 0.8.

Now colsample_bytree:

parameters_xgb3 <- expand.grid(nrounds = 50,
                              max_depth = 13,
                              eta = c(0.25), 
                              gamma = 1,
                              colsample_bytree = seq(0.1, 0.8, 0.1),
                              min_child_weight = 100,
                              subsample = 0.8)
set.seed(123456789)
matches.xgb3 <- train(model1.formula,
                      data = matches.train,
                      method = "xgbTree",
                      trControl = ctrl_cv3,
                      tuneGrid  = parameters_xgb3)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

matches.xgb3

## eXtreme Gradient Boosting 
## 
## 3724 samples
##   12 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 2483, 2482, 2483 
## Resampling results across tuning parameters:
## 
##   colsample_bytree  ROC        Sens       Spec     
##   0.1               0.6390969  0.6170762  0.5757329
##   0.2               0.6397274  0.6090264  0.5929086
##   0.3               0.6400269  0.6004190  0.5896958
##   0.4               0.6392504  0.6014934  0.5998944
##   0.5               0.6399506  0.6057945  0.5929164
##   0.6               0.6391113  0.5993438  0.6020363
##   0.7               0.6368069  0.5934419  0.5891529
##   0.8               0.6368582  0.5961266  0.5945310
## 
## Tuning parameter 'nrounds' was held constant at a value of 50
## Tuning
##  'min_child_weight' was held constant at a value of 100
## Tuning
##  parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 13, eta
##  = 0.25, gamma = 1, colsample_bytree = 0.3, min_child_weight = 100
##  and subsample = 0.8.

The final values used for the model were nrounds = 50, max_depth = 13, eta = 0.25, gamma = 1, colsample_bytree = 0.3, min_child_weight = 100 and subsample = 0.8.

Now subsample:

parameters_xgb4 <- expand.grid(nrounds = 50,
                              max_depth = 13,
                              eta = c(0.25), 
                              gamma = 1,
                              colsample_bytree = 0.3,
                              min_child_weight = 100,
                              subsample = c(0.6, 0.7, 0.75, 0.8, 0.85, 0.9))

set.seed(123456789)
matches.xgb4 <- train(model1.formula,
                      data = matches.train,
                      method = "xgbTree",
                      trControl = ctrl_cv3,
                      tuneGrid  = parameters_xgb4)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

matches.xgb4

## eXtreme Gradient Boosting 
## 
## 3724 samples
##   12 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 2483, 2482, 2483 
## Resampling results across tuning parameters:
## 
##   subsample  ROC        Sens       Spec     
##   0.60       0.6377160  0.6111518  0.5810910
##   0.70       0.6356003  0.5902170  0.5988105
##   0.75       0.6409676  0.6057936  0.5886205
##   0.80       0.6409486  0.5880647  0.6084835
##   0.85       0.6399447  0.6117085  0.5939899
##   0.90       0.6408472  0.6084870  0.5929077
## 
## Tuning parameter 'nrounds' was held constant at a value of 50
## Tuning
##  'colsample_bytree' was held constant at a value of 0.3
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 100
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 13, eta
##  = 0.25, gamma = 1, colsample_bytree = 0.3, min_child_weight = 100
##  and subsample = 0.75.

The final values used for the model were nrounds = 50, max_depth = 13, eta = 0.25, gamma = 1, colsample_bytree = 0.3, min_child_weight = 100 and subsample = 0.75.

Now reduce learning rate eta and take higher number of rounds:

parameters_xgb5 <- expand.grid(nrounds = 100,
                              max_depth = 13,
                              eta = 0.12, 
                              gamma = 1,
                              colsample_bytree = 0.3,
                              min_child_weight = 100,
                              subsample = 0.75)

set.seed(123456789)
matches.xgb5 <- train(model1.formula,
                      data = matches.train,
                      method = "xgbTree",
                      trControl = ctrl_cv3,
                      tuneGrid  = parameters_xgb5)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

matches.xgb5

## eXtreme Gradient Boosting 
## 
## 3724 samples
##   12 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 2483, 2482, 2483 
## Resampling results:
## 
##   ROC        Sens      Spec     
##   0.6416779  0.609551  0.5934558
## 
## Tuning parameter 'nrounds' was held constant at a value of 100
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 100
## 
## Tuning parameter 'subsample' was held constant at a value of 0.75

ROC value is slightly higher.

Once again – higher nrounds and lower eta:

parameters_xgb6 <- expand.grid(nrounds = 200,
                              max_depth = 13,
                              eta = 0.06, 
                              gamma = 1,
                              colsample_bytree = 0.3,
                              min_child_weight = 100,
                              subsample = 0.75)

set.seed(123456789)
matches.xgb6 <- train(model1.formula,
                      data = matches.train,
                      method = "xgbTree",
                      trControl = ctrl_cv3,
                      tuneGrid  = parameters_xgb6)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

matches.xgb6

## eXtreme Gradient Boosting 
## 
## 3724 samples
##   12 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 2483, 2482, 2483 
## Resampling results:
## 
##   ROC      Sens      Spec     
##   0.64145  0.605256  0.5993619
## 
## Tuning parameter 'nrounds' was held constant at a value of 200
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 100
## 
## Tuning parameter 'subsample' was held constant at a value of 0.75

This time ROC is lowered.

source(here("getAccuracyAndGini.R"))
(models <- c("", "2":"6"))

## [1] ""  "2" "3" "4" "5" "6"

sapply(paste0("matches.xgb", models),
       function(x) getAccuracyAndGini(model = get(x),
                                      data = matches.test,
                                      target_variable = "home",
                                      predicted_class = "yes")
)

##             matches.xgb matches.xgb2 matches.xgb3 matches.xgb4 matches.xgb5
## Accuracy      0.5971178    0.5989975    0.6121554    0.5996241    0.6090226
## Sensitivity   0.5639098    0.5601504    0.5689223    0.5614035    0.5839599
## Specificity   0.6303258    0.6378446    0.6553885    0.6378446    0.6340852
## Gini          0.2632772    0.2724292    0.2663394    0.2714006    0.2723805
##             matches.xgb6
## Accuracy       0.6058897
## Sensitivity    0.5814536
## Specificity    0.6303258
## Gini           0.2732144

For the test data the best results in terms of Gini coefficient are obtained with the last trained model – matches.xgb6.

ROC.xgb6.train <- pROC::roc(matches.train$home, 
                       predict(matches.xgb6,
                               matches.train, type = "prob")[, "yes"])

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

ROC.xgb6.test <- pROC::roc(matches.test$home, 
                       predict(matches.xgb6,
                               matches.test, type = "prob")[, "yes"])

## Setting levels: control = no, case = yes
## Setting direction: controls < cases

ROC.xgb6.train <- pROC::roc(matches.train$home, 
                           predict(matches.xgb6,
                                   matches.train, 
                                   type = "prob",
                                   n.trees = 500)[, "yes"])

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

ROC.xgb6.test  <- pROC::roc(matches.test$home, 
                           predict(matches.xgb6,
                                   matches.test, 
                                   type = "prob",
                                   n.trees = 500)[, "yes"])

## Setting levels: control = no, case = yes
## Setting direction: controls < cases

list(
  ROC.xgb6.train  = ROC.xgb6.train,
  ROC.xgb6.test   = ROC.xgb6.test
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "xgb6 = ", 
                         round(100 * (2 * auc(ROC.xgb6.train) - 1), 1), "%, ",
                         "Gini TEST: ",
                         "xgb6 = ", 
                         round(100 * (2 * auc(ROC.xgb6.test) - 1), 1), "% ")) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

So far XGB gives the best values of Gini coefficient out of all tested ML methods. However, it is still not better than the classical logistic regression.

A comparison of predicted and actual values can be also presented in terms of probability distribution function. We see below that predictions for ‘yes’ and ‘no,’ which is the answer to the question of being a match host, are not sharply separated. They have the most of their mass in the region of lower probability for answer ‘no’ and in the region of higher probability for answer ‘yes,’ however big parts of these distribution overlap.

pred.xgb6.test =  predict(matches.xgb6,
                          matches.test, type = "raw")

confusionMatrix.xgb6.test = confusionMatrix(data = pred.xgb6.test, # predictions
                                            # actual values
                                            reference = as.factor(matches.test$home),
                                            # definitions of the "success" label
                                            positive = "yes")

tibble(
  label = matches.train$home, 
  pred  = predict(matches.xgb6, matches.train, reshape = TRUE, type = 'prob')[, 'yes']
) %>%
  ggplot(aes(pred, fill = label)) +
  geom_density(alpha = 0.4) +
  labs(
    title = "Predictions distribution vs. target variable",
    subtitle = "train sample",
    caption = "source: own calculations with the xgboost model"
  )

The same case can be observed for the test sample:

tibble(
  label = matches.test$home, 
  pred  = predict(matches.xgb6, matches.test, reshape = TRUE, type = 'prob')[, 'yes']
) %>%
  ggplot(aes(pred, fill = label)) +
  geom_density(alpha = 0.4) +
  labs(
    title = "Predictions distribution vs. target variable",
    subtitle = "test sample",
    caption = "source: own calculations with the xgboost model"
  )

Neural networks

The last utilized method is the neural network approach. Two implementations are considered: neuralnet and keras packages.

Neural networks algorithms require the data to be properly scaled. It must also be transformed into numeric values, not factor, as in the case of the dependent variable.

str(matches.train)

## tibble [3,724 x 13] (S3: tbl_df/tbl/data.frame)
##  $ GoalsHT       : num [1:3724] 0 0 0 0 0 1 1 1 2 0 ...
##  $ GoalsFT       : num [1:3724] 1 0 1 0 1 2 2 2 2 0 ...
##  $ BallPos       : num [1:3724] 0.6 0.51 0.49 0.63 0.37 0.58 0.42 0.37 0.63 0.47 ...
##  $ ShotsOffTarget: num [1:3724] 5 7 3 4 4 5 2 5 5 10 ...
##  $ ShotsOnTarget : num [1:3724] 5 6 4 2 1 5 2 3 3 4 ...
##  $ BlockedShots  : num [1:3724] 4 6 4 6 2 0 3 3 5 4 ...
##  $ Corners       : num [1:3724] 4 8 9 2 8 6 3 3 6 8 ...
##  $ PassSuccPerc  : num [1:3724] 0.86 0.77 0.76 0.84 0.68 0.8 0.75 0.77 0.84 0.83 ...
##  $ AerialsWon    : num [1:3724] 20 30 15 30 9 16 15 27 14 15 ...
##  $ Fouls         : num [1:3724] 14 10 10 14 9 18 9 16 10 12 ...
##  $ YellowCards   : num [1:3724] 2 1 2 0 3 3 1 1 1 2 ...
##  $ RedCards      : num [1:3724] 0 0 0 0 0 0 0 0 0 0 ...
##  $ home          : Factor w/ 2 levels "no","yes": 2 2 1 2 1 2 1 2 1 2 ...

matches.train.nn = matches.train
matches.test.nn = matches.test

str(matches.train.nn)

## tibble [3,724 x 13] (S3: tbl_df/tbl/data.frame)
##  $ GoalsHT       : num [1:3724] 0 0 0 0 0 1 1 1 2 0 ...
##  $ GoalsFT       : num [1:3724] 1 0 1 0 1 2 2 2 2 0 ...
##  $ BallPos       : num [1:3724] 0.6 0.51 0.49 0.63 0.37 0.58 0.42 0.37 0.63 0.47 ...
##  $ ShotsOffTarget: num [1:3724] 5 7 3 4 4 5 2 5 5 10 ...
##  $ ShotsOnTarget : num [1:3724] 5 6 4 2 1 5 2 3 3 4 ...
##  $ BlockedShots  : num [1:3724] 4 6 4 6 2 0 3 3 5 4 ...
##  $ Corners       : num [1:3724] 4 8 9 2 8 6 3 3 6 8 ...
##  $ PassSuccPerc  : num [1:3724] 0.86 0.77 0.76 0.84 0.68 0.8 0.75 0.77 0.84 0.83 ...
##  $ AerialsWon    : num [1:3724] 20 30 15 30 9 16 15 27 14 15 ...
##  $ Fouls         : num [1:3724] 14 10 10 14 9 18 9 16 10 12 ...
##  $ YellowCards   : num [1:3724] 2 1 2 0 3 3 1 1 1 2 ...
##  $ RedCards      : num [1:3724] 0 0 0 0 0 0 0 0 0 0 ...
##  $ home          : Factor w/ 2 levels "no","yes": 2 2 1 2 1 2 1 2 1 2 ...

matches.train.nn$home = as.numeric(matches.train.nn$home)-1

str(matches.test.nn)

## tibble [1,596 x 13] (S3: tbl_df/tbl/data.frame)
##  $ GoalsHT       : num [1:1596] 1 0 0 1 1 0 1 0 0 1 ...
##  $ GoalsFT       : num [1:1596] 2 1 0 1 1 1 3 0 0 3 ...
##  $ BallPos       : num [1:1596] 0.4 0.53 0.44 0.39 0.55 0.46 0.53 0.64 0.72 0.64 ...
##  $ ShotsOffTarget: num [1:1596] 0 2 9 6 3 3 4 3 5 4 ...
##  $ ShotsOnTarget : num [1:1596] 4 4 0 2 5 4 4 2 2 8 ...
##  $ BlockedShots  : num [1:1596] 1 4 3 1 2 2 1 2 9 6 ...
##  $ Corners       : num [1:1596] 0 5 3 4 2 6 6 6 11 9 ...
##  $ PassSuccPerc  : num [1:1596] 0.8 0.8 0.83 0.7 0.86 0.73 0.86 0.82 0.88 0.86 ...
##  $ AerialsWon    : num [1:1596] 10 12 14 27 19 27 7 15 13 15 ...
##  $ Fouls         : num [1:1596] 20 10 8 6 14 14 13 14 14 13 ...
##  $ YellowCards   : num [1:1596] 4 0 1 1 2 1 1 2 2 1 ...
##  $ RedCards      : num [1:1596] 0 1 0 0 0 0 0 0 0 0 ...
##  $ home          : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 1 2 1 ...

matches.test.nn$home = as.numeric(matches.test.nn$home)-1

min <- apply(matches.train.nn, 2, min)
max <- apply(matches.train.nn, 2, max)
matches.train.nn <- scale(matches.train.nn, center = min, scale = max)
matches.test.nn  <- scale(matches.test.nn,  center = min, scale = max)

matches.train.nn = as.data.frame(matches.train.nn)
matches.test.nn = as.data.frame(matches.test.nn)

Neuralnet

The first network contains 1 hidden layer with 10 neurons.

library(neuralnet)

set.seed(12345678)
matches.nn =  neuralnet(model1.formula,
            data = as.data.frame(matches.train.nn),
            hidden = c(10), # number of neurons in hidden layers
            linear.output = FALSE, # T for regression, F for classification
            learningrate.limit = NULL,
            learningrate.factor = list(minus = 0.5, plus = 1.2),
            algorithm = "rprop+",
            threshold = 0.01,
            stepmax = 200000,
            lifesign = 'full',
            rep = 1)

## hidden: 10    thresh: 0.01    rep: 1/1    steps:    1000 min thresh: 0.285513111171468
##                                                     2000 min thresh: 0.285513111171468
##                                                     3000 min thresh: 0.285513111171468
##                                                     4000 min thresh: 0.285513111171468
##                                                     5000 min thresh: 0.269794093488197
##                                                     6000 min thresh: 0.196967865838763
##                                                     7000 min thresh: 0.14297333122751
##                                                     8000 min thresh: 0.14297333122751
##                                                     9000 min thresh: 0.135969245297159
##                                                    10000 min thresh: 0.103579424269917
##                                                    11000 min thresh: 0.0843860014251815
##                                                    12000 min thresh: 0.0781392840311353
##                                                    13000 min thresh: 0.063145855219901
##                                                    14000 min thresh: 0.0611569693078066
##                                                    15000 min thresh: 0.0551528157451127
##                                                    16000 min thresh: 0.0492557118794925
##                                                    17000 min thresh: 0.0447836494391428
##                                                    18000 min thresh: 0.0391116232910056
##                                                    19000 min thresh: 0.0364520117790527
##                                                    20000 min thresh: 0.0333051800820802
##                                                    21000 min thresh: 0.0315163732418564
##                                                    22000 min thresh: 0.0312802554254512
##                                                    23000 min thresh: 0.0269206408057176
##                                                    24000 min thresh: 0.0239715346379435
##                                                    25000 min thresh: 0.0239715346379435
##                                                    26000 min thresh: 0.0238521209239861
##                                                    27000 min thresh: 0.0210035388631637
##                                                    28000 min thresh: 0.0210035388631637
##                                                    29000 min thresh: 0.0201386636930678
##                                                    30000 min thresh: 0.0201386636930678
##                                                    31000 min thresh: 0.0198222357815766
##                                                    32000 min thresh: 0.0185371729384217
##                                                    33000 min thresh: 0.0177756417029591
##                                                    34000 min thresh: 0.0166162361505436
##                                                    35000 min thresh: 0.0159814516745024
##                                                    36000 min thresh: 0.0152430003802028
##                                                    37000 min thresh: 0.0145589465698915
##                                                    38000 min thresh: 0.0143316821065908
##                                                    39000 min thresh: 0.0139635498404982
##                                                    40000 min thresh: 0.0130249903399839
##                                                    41000 min thresh: 0.0128994947710178
##                                                    42000 min thresh: 0.0128125445957665
##                                                    43000 min thresh: 0.0120673549797706
##                                                    44000 min thresh: 0.0109587565087845
##                                                    45000 min thresh: 0.0104650116045324
##                                                    46000 min thresh: 0.0104650116045324
##                                                    47000 min thresh: 0.0103352768924121
##                                                    47023 error: 384.08076    time: 1.56 mins

matches.nn$result.matrix

##                                      [,1]
## error                        3.840808e+02
## reached.threshold            9.729959e-03
## steps                        4.702300e+04
## Intercept.to.1layhid1        6.458586e+01
## GoalsHT.to.1layhid1          4.991633e+01
## GoalsFT.to.1layhid1         -6.020379e+01
## BallPos.to.1layhid1          9.043190e+01
## ShotsOffTarget.to.1layhid1   1.184209e+01
## ShotsOnTarget.to.1layhid1    1.146165e+01
## BlockedShots.to.1layhid1     4.472901e+01
## Corners.to.1layhid1         -6.528187e+01
## PassSuccPerc.to.1layhid1    -1.293327e+02
## AerialsWon.to.1layhid1      -5.262872e+01
## Fouls.to.1layhid1           -9.146615e+01
## YellowCards.to.1layhid1      1.189128e+02
## RedCards.to.1layhid1        -1.655122e+01
## Intercept.to.1layhid2       -6.678545e+00
## GoalsHT.to.1layhid2          2.263172e+02
## GoalsFT.to.1layhid2         -7.845360e+02
## BallPos.to.1layhid2         -1.423712e+01
## ShotsOffTarget.to.1layhid2   6.132896e+01
## ShotsOnTarget.to.1layhid2    2.968429e+02
## BlockedShots.to.1layhid2    -5.476262e+02
## Corners.to.1layhid2          1.328246e+01
## PassSuccPerc.to.1layhid2    -3.304635e+02
## AerialsWon.to.1layhid2      -7.592418e+01
## Fouls.to.1layhid2            2.120929e+02
## YellowCards.to.1layhid2     -3.123988e+02
## RedCards.to.1layhid2         1.225317e+00
## Intercept.to.1layhid3        8.480925e+01
## GoalsHT.to.1layhid3         -9.710812e+02
## GoalsFT.to.1layhid3          6.696841e+00
## BallPos.to.1layhid3         -1.447113e+02
## ShotsOffTarget.to.1layhid3  -1.258127e+02
## ShotsOnTarget.to.1layhid3    8.177245e+01
## BlockedShots.to.1layhid3    -5.703170e+02
## Corners.to.1layhid3          2.974772e+02
## PassSuccPerc.to.1layhid3     2.113029e+01
## AerialsWon.to.1layhid3      -2.978337e+02
## Fouls.to.1layhid3           -1.240649e+02
## YellowCards.to.1layhid3      4.268320e+01
## RedCards.to.1layhid3         2.869099e+02
## Intercept.to.1layhid4       -2.526956e+01
## GoalsHT.to.1layhid4         -7.189838e+02
## GoalsFT.to.1layhid4          1.140839e+01
## BallPos.to.1layhid4          2.046329e+02
## ShotsOffTarget.to.1layhid4   3.353699e+02
## ShotsOnTarget.to.1layhid4    4.884052e+01
## BlockedShots.to.1layhid4    -3.264203e+02
## Corners.to.1layhid4          1.875049e+02
## PassSuccPerc.to.1layhid4     3.764271e+01
## AerialsWon.to.1layhid4       1.427801e+02
## Fouls.to.1layhid4           -9.190623e+01
## YellowCards.to.1layhid4     -2.984296e+02
## RedCards.to.1layhid4        -5.654870e+01
## Intercept.to.1layhid5        6.546110e+01
## GoalsHT.to.1layhid5          4.445189e+02
## GoalsFT.to.1layhid5          1.395439e+02
## BallPos.to.1layhid5         -1.294861e+01
## ShotsOffTarget.to.1layhid5  -1.029155e+02
## ShotsOnTarget.to.1layhid5    1.356225e+00
## BlockedShots.to.1layhid5    -1.708471e+02
## Corners.to.1layhid5          1.982213e+02
## PassSuccPerc.to.1layhid5     2.528641e+02
## AerialsWon.to.1layhid5      -2.754776e+02
## Fouls.to.1layhid5           -4.098509e+01
## YellowCards.to.1layhid5      9.375047e+01
## RedCards.to.1layhid5        -1.303938e+02
## Intercept.to.1layhid6        5.978276e+00
## GoalsHT.to.1layhid6         -9.942881e+00
## GoalsFT.to.1layhid6          1.916425e+00
## BallPos.to.1layhid6          3.548889e-01
## ShotsOffTarget.to.1layhid6  -1.287723e+00
## ShotsOnTarget.to.1layhid6    5.520516e+00
## BlockedShots.to.1layhid6    -9.598037e+00
## Corners.to.1layhid6         -7.998171e+00
## PassSuccPerc.to.1layhid6     1.710620e+00
## AerialsWon.to.1layhid6       1.627903e-01
## Fouls.to.1layhid6           -8.670247e+00
## YellowCards.to.1layhid6      1.737480e+00
## RedCards.to.1layhid6        -8.048826e-01
## Intercept.to.1layhid7        5.043153e+01
## GoalsHT.to.1layhid7          1.728859e+01
## GoalsFT.to.1layhid7          2.779677e+02
## BallPos.to.1layhid7         -5.165693e+02
## ShotsOffTarget.to.1layhid7   2.205352e+02
## ShotsOnTarget.to.1layhid7    7.241200e+01
## BlockedShots.to.1layhid7    -5.592941e+01
## Corners.to.1layhid7          3.535524e+01
## PassSuccPerc.to.1layhid7     9.388156e+01
## AerialsWon.to.1layhid7      -1.484819e+02
## Fouls.to.1layhid7            1.477469e+02
## YellowCards.to.1layhid7     -2.616297e+02
## RedCards.to.1layhid7         4.133430e+01
## Intercept.to.1layhid8        8.879912e+01
## GoalsHT.to.1layhid8         -3.321490e+01
## GoalsFT.to.1layhid8         -2.418978e+01
## BallPos.to.1layhid8         -8.322425e+01
## ShotsOffTarget.to.1layhid8  -8.661314e+01
## ShotsOnTarget.to.1layhid8   -3.653862e+02
## BlockedShots.to.1layhid8    -3.842196e+01
## Corners.to.1layhid8         -9.108920e+01
## PassSuccPerc.to.1layhid8     1.364117e+02
## AerialsWon.to.1layhid8       4.068274e+02
## Fouls.to.1layhid8            3.341251e+02
## YellowCards.to.1layhid8     -1.928375e+02
## RedCards.to.1layhid8        -1.869374e+00
## Intercept.to.1layhid9        8.342168e+00
## GoalsHT.to.1layhid9          1.748306e+01
## GoalsFT.to.1layhid9         -1.855226e+01
## BallPos.to.1layhid9          2.165517e+01
## ShotsOffTarget.to.1layhid9  -3.549926e+00
## ShotsOnTarget.to.1layhid9    1.040237e+01
## BlockedShots.to.1layhid9     2.676200e+00
## Corners.to.1layhid9         -8.214676e+00
## PassSuccPerc.to.1layhid9    -1.235792e+01
## AerialsWon.to.1layhid9      -3.037610e+00
## Fouls.to.1layhid9           -2.474463e+01
## YellowCards.to.1layhid9      2.042643e+01
## RedCards.to.1layhid9         1.426404e+00
## Intercept.to.1layhid10       9.603258e+01
## GoalsHT.to.1layhid10         1.559129e+02
## GoalsFT.to.1layhid10         1.231473e+03
## BallPos.to.1layhid10        -2.197492e+02
## ShotsOffTarget.to.1layhid10  5.524626e+01
## ShotsOnTarget.to.1layhid10   2.376378e+02
## BlockedShots.to.1layhid10   -2.601905e+01
## Corners.to.1layhid10         2.060538e+02
## PassSuccPerc.to.1layhid10    1.287520e+02
## AerialsWon.to.1layhid10     -2.857632e+02
## Fouls.to.1layhid10          -9.967303e+01
## YellowCards.to.1layhid10     9.754119e+01
## RedCards.to.1layhid10        1.245566e+01
## Intercept.to.home           -5.396550e-01
## 1layhid1.to.home            -1.948927e+00
## 1layhid2.to.home            -2.554309e+00
## 1layhid3.to.home            -2.065735e+00
## 1layhid4.to.home             9.011349e-01
## 1layhid5.to.home            -1.879732e+00
## 1layhid6.to.home            -1.205966e+00
## 1layhid7.to.home             9.453114e-01
## 1layhid8.to.home            -1.146854e+00
## 1layhid9.to.home             2.536890e+00
## 1layhid10.to.home            2.778417e+00

plot(matches.nn, rep = "best", arrow.length = 0.15, information = T)

pred.nn.train <- predict(matches.nn,
                      matches.train,
                      type = "response")

nn.train.confMat = confusionMatrix(data = as.factor(ifelse(pred.nn.train > 0.5, 1, 0)), # predictions
                # actual values
                reference = as.factor(matches.train.nn$home),
                # definitions of the "success" label
                positive = "1") 
nn.train.confMat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1261 1044
##          1  601  818
##                                           
##                Accuracy : 0.5583          
##                  95% CI : (0.5421, 0.5743)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 6.087e-13       
##                                           
##                   Kappa : 0.1165          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4393          
##             Specificity : 0.6772          
##          Pos Pred Value : 0.5765          
##          Neg Pred Value : 0.5471          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2197          
##    Detection Prevalence : 0.3810          
##       Balanced Accuracy : 0.5583          
##                                           
##        'Positive' Class : 1               
##

pred.nn.test <- predict(matches.nn,
                      matches.test,
                      type = "response")

confusionMatrix.nn.test = confusionMatrix(data = as.factor(ifelse(pred.nn.test > 0.5, 1, 0)), # predictions
                # actual values
                reference = as.factor(matches.test.nn$home),
                # definitions of the "success" label
                positive = "1") 
confusionMatrix.nn.test

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 544 445
##          1 254 353
##                                           
##                Accuracy : 0.562           
##                  95% CI : (0.5373, 0.5866)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 3.963e-07       
##                                           
##                   Kappa : 0.1241          
##                                           
##  Mcnemar's Test P-Value : 6.649e-13       
##                                           
##             Sensitivity : 0.4424          
##             Specificity : 0.6817          
##          Pos Pred Value : 0.5815          
##          Neg Pred Value : 0.5501          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2212          
##    Detection Prevalence : 0.3803          
##       Balanced Accuracy : 0.5620          
##                                           
##        'Positive' Class : 1               
##

Obtained accuracy is 0.55 for training and 0.57 (higher!) for test data set.

Lets check some bigger neural network.

Note: Threshold had to be raised because it was not possible to achieve convergence.

set.seed(12345678)
matches.nn.bigger =  neuralnet(model1.formula,
            data = as.data.frame(matches.train.nn),
            hidden = c(10, 5, 3), # number of neurons in hidden layers
            linear.output = FALSE, # T for regression, F for classification
            learningrate.limit = NULL,
            learningrate.factor = list(minus = 0.5, plus = 1.2),
            algorithm = "rprop+",
            threshold = 0.1,
            stepmax = 200000,
            lifesign = 'full',
            rep = 1)

## hidden: 10, 5, 3    thresh: 0.1    rep: 1/1    steps:    1000    min thresh: 0.258841055656181
##                                                          2000    min thresh: 0.251902633983841
##                                                          3000    min thresh: 0.251902633983841
##                                                          4000    min thresh: 0.251151331563959
##                                                          5000    min thresh: 0.22200807337225
##                                                          6000    min thresh: 0.214703562576889
##                                                          7000    min thresh: 0.214703562576889
##                                                          8000    min thresh: 0.185039180033865
##                                                          9000    min thresh: 0.173411899039052
##                                                         10000    min thresh: 0.173411899039052
##                                                         11000    min thresh: 0.173411899039052
##                                                         12000    min thresh: 0.173411899039052
##                                                         13000    min thresh: 0.173411899039052
##                                                         14000    min thresh: 0.173411899039052
##                                                         15000    min thresh: 0.173411899039052
##                                                         16000    min thresh: 0.173411899039052
##                                                         17000    min thresh: 0.173411899039052
##                                                         18000    min thresh: 0.173411899039052
##                                                         19000    min thresh: 0.173411899039052
##                                                         20000    min thresh: 0.173411899039052
##                                                         21000    min thresh: 0.173411899039052
##                                                         22000    min thresh: 0.173411899039052
##                                                         23000    min thresh: 0.173411899039052
##                                                         24000    min thresh: 0.173411899039052
##                                                         25000    min thresh: 0.173411899039052
##                                                         26000    min thresh: 0.173411899039052
##                                                         27000    min thresh: 0.173411899039052
##                                                         28000    min thresh: 0.173411899039052
##                                                         29000    min thresh: 0.173411899039052
##                                                         30000    min thresh: 0.173411899039052
##                                                         31000    min thresh: 0.173411899039052
##                                                         32000    min thresh: 0.173411899039052
##                                                         33000    min thresh: 0.173411899039052
##                                                         34000    min thresh: 0.173411899039052
##                                                         35000    min thresh: 0.173411899039052
##                                                         36000    min thresh: 0.173411899039052
##                                                         37000    min thresh: 0.173411899039052
##                                                         38000    min thresh: 0.173411899039052
##                                                         39000    min thresh: 0.173411899039052
##                                                         40000    min thresh: 0.173411899039052
##                                                         41000    min thresh: 0.173411899039052
##                                                         42000    min thresh: 0.173411899039052
##                                                         43000    min thresh: 0.173411899039052
##                                                         44000    min thresh: 0.173411899039052
##                                                         45000    min thresh: 0.173411899039052
##                                                         46000    min thresh: 0.173411899039052
##                                                         47000    min thresh: 0.173411899039052
##                                                         48000    min thresh: 0.16508440161435
##                                                         49000    min thresh: 0.159308285129043
##                                                         50000    min thresh: 0.149723347670025
##                                                         51000    min thresh: 0.149723347670025
##                                                         52000    min thresh: 0.149475401676328
##                                                         53000    min thresh: 0.148834832168141
##                                                         54000    min thresh: 0.148834832168141
##                                                         55000    min thresh: 0.148834832168141
##                                                         56000    min thresh: 0.148834832168141
##                                                         57000    min thresh: 0.148834832168141
##                                                         58000    min thresh: 0.141241343189484
##                                                         59000    min thresh: 0.141241343189484
##                                                         60000    min thresh: 0.141241343189484
##                                                         61000    min thresh: 0.141241343189484
##                                                         62000    min thresh: 0.141241343189484
##                                                         63000    min thresh: 0.132195524108353
##                                                         64000    min thresh: 0.132195524108353
##                                                         65000    min thresh: 0.123023944550543
##                                                         66000    min thresh: 0.123023944550543
##                                                         67000    min thresh: 0.123023944550543
##                                                         68000    min thresh: 0.123023944550543
##                                                         69000    min thresh: 0.123023944550543
##                                                         70000    min thresh: 0.123023944550543
##                                                         71000    min thresh: 0.116938540335543
##                                                         72000    min thresh: 0.116938540335543
##                                                         73000    min thresh: 0.116938540335543
##                                                         74000    min thresh: 0.116938540335543
##                                                         75000    min thresh: 0.116938540335543
##                                                         76000    min thresh: 0.114354905558193
##                                                         77000    min thresh: 0.114354905558193
##                                                         78000    min thresh: 0.113584327113829
##                                                         79000    min thresh: 0.10482534413861
##                                                         80000    min thresh: 0.101844255864522
##                                                         81000    min thresh: 0.101844255864522
##                                                         82000    min thresh: 0.101844255864522
##                                                         83000    min thresh: 0.101844255864522
##                                                         83447    error: 368.78614    time: 4.04 mins

matches.nn.bigger$result.matrix

##                                      [,1]
## error                        3.687861e+02
## reached.threshold            9.953540e-02
## steps                        8.344700e+04
## Intercept.to.1layhid1       -7.040357e+00
## GoalsHT.to.1layhid1         -6.963351e+01
## GoalsFT.to.1layhid1          2.325536e+00
## BallPos.to.1layhid1         -6.860489e+01
## ShotsOffTarget.to.1layhid1   1.576957e+01
## ShotsOnTarget.to.1layhid1   -5.792615e+01
## BlockedShots.to.1layhid1    -1.476103e+01
## Corners.to.1layhid1          2.706482e+01
## PassSuccPerc.to.1layhid1     8.196845e+00
## AerialsWon.to.1layhid1       3.732609e+02
## Fouls.to.1layhid1           -2.471538e+01
## YellowCards.to.1layhid1      1.350629e+02
## RedCards.to.1layhid1         1.503123e+01
## Intercept.to.1layhid2       -1.718273e+00
## GoalsHT.to.1layhid2          2.380330e-01
## GoalsFT.to.1layhid2          1.305701e+01
## BallPos.to.1layhid2         -3.207011e+00
## ShotsOffTarget.to.1layhid2   9.811692e+00
## ShotsOnTarget.to.1layhid2    4.050073e+00
## BlockedShots.to.1layhid2    -1.245726e+01
## Corners.to.1layhid2          4.833888e-01
## PassSuccPerc.to.1layhid2    -1.064574e+00
## AerialsWon.to.1layhid2       4.370854e+00
## Fouls.to.1layhid2           -9.626796e+00
## YellowCards.to.1layhid2     -7.587365e+00
## RedCards.to.1layhid2         8.543156e+00
## Intercept.to.1layhid3       -7.383724e+00
## GoalsHT.to.1layhid3         -2.134369e+01
## GoalsFT.to.1layhid3          3.124885e+01
## BallPos.to.1layhid3         -6.214812e+00
## ShotsOffTarget.to.1layhid3   2.762798e+01
## ShotsOnTarget.to.1layhid3   -9.615544e-01
## BlockedShots.to.1layhid3    -1.168879e+01
## Corners.to.1layhid3         -9.688361e+00
## PassSuccPerc.to.1layhid3    -9.267291e+00
## AerialsWon.to.1layhid3       1.957771e+01
## Fouls.to.1layhid3            1.227819e+00
## YellowCards.to.1layhid3      7.131152e+01
## RedCards.to.1layhid3        -8.885766e-01
## Intercept.to.1layhid4        1.492267e+00
## GoalsHT.to.1layhid4         -1.125044e-01
## GoalsFT.to.1layhid4         -3.991955e+00
## BallPos.to.1layhid4          2.878284e-01
## ShotsOffTarget.to.1layhid4  -2.737531e+00
## ShotsOnTarget.to.1layhid4   -9.853406e+00
## BlockedShots.to.1layhid4    -1.018777e+01
## Corners.to.1layhid4          5.830624e-01
## PassSuccPerc.to.1layhid4     1.770311e+00
## AerialsWon.to.1layhid4       5.256519e-01
## Fouls.to.1layhid4            2.259209e+00
## YellowCards.to.1layhid4     -1.608331e+00
## RedCards.to.1layhid4         1.187943e+00
## Intercept.to.1layhid5        3.495855e-01
## GoalsHT.to.1layhid5         -1.370682e+01
## GoalsFT.to.1layhid5         -1.538569e+00
## BallPos.to.1layhid5         -4.472841e+00
## ShotsOffTarget.to.1layhid5  -2.152538e+00
## ShotsOnTarget.to.1layhid5    9.965417e-01
## BlockedShots.to.1layhid5     1.204617e+00
## Corners.to.1layhid5         -3.935971e+00
## PassSuccPerc.to.1layhid5     5.866326e-01
## AerialsWon.to.1layhid5       6.040828e+00
## Fouls.to.1layhid5            1.078237e+01
## YellowCards.to.1layhid5     -1.243631e+01
## RedCards.to.1layhid5        -4.771623e+00
## Intercept.to.1layhid6        1.412137e+00
## GoalsHT.to.1layhid6         -5.605170e+00
## GoalsFT.to.1layhid6          1.435732e+00
## BallPos.to.1layhid6          5.326573e+00
## ShotsOffTarget.to.1layhid6   1.035111e+00
## ShotsOnTarget.to.1layhid6   -5.951718e-01
## BlockedShots.to.1layhid6    -1.034811e+01
## Corners.to.1layhid6         -9.892569e+00
## PassSuccPerc.to.1layhid6    -1.922237e+00
## AerialsWon.to.1layhid6       3.907944e+00
## Fouls.to.1layhid6           -5.408726e-01
## YellowCards.to.1layhid6      6.273727e+00
## RedCards.to.1layhid6         3.955135e+01
## Intercept.to.1layhid7        7.376309e-01
## GoalsHT.to.1layhid7         -1.170902e+01
## GoalsFT.to.1layhid7         -3.268167e+00
## BallPos.to.1layhid7         -7.383349e-01
## ShotsOffTarget.to.1layhid7   1.110020e+00
## ShotsOnTarget.to.1layhid7    5.281076e+00
## BlockedShots.to.1layhid7    -6.003796e+00
## Corners.to.1layhid7         -3.137632e+00
## PassSuccPerc.to.1layhid7     2.333979e+00
## AerialsWon.to.1layhid7      -2.003091e+00
## Fouls.to.1layhid7            1.035093e+00
## YellowCards.to.1layhid7     -8.841806e+00
## RedCards.to.1layhid7        -1.255774e+00
## Intercept.to.1layhid8       -4.900037e-01
## GoalsHT.to.1layhid8          3.783375e-01
## GoalsFT.to.1layhid8          1.193727e+00
## BallPos.to.1layhid8          4.331776e+00
## ShotsOffTarget.to.1layhid8   3.349571e-01
## ShotsOnTarget.to.1layhid8   -2.043312e+00
## BlockedShots.to.1layhid8    -1.311085e+00
## Corners.to.1layhid8         -1.197097e+00
## PassSuccPerc.to.1layhid8    -4.886832e-01
## AerialsWon.to.1layhid8      -8.516111e+00
## Fouls.to.1layhid8           -1.947599e+00
## YellowCards.to.1layhid8     -1.890971e+00
## RedCards.to.1layhid8         1.227204e+00
## Intercept.to.1layhid9       -6.237794e+00
## GoalsHT.to.1layhid9          2.796424e+01
## GoalsFT.to.1layhid9          3.057003e+01
## BallPos.to.1layhid9         -3.989111e+00
## ShotsOffTarget.to.1layhid9   3.780770e+01
## ShotsOnTarget.to.1layhid9   -2.902337e+01
## BlockedShots.to.1layhid9     2.402594e+01
## Corners.to.1layhid9          3.644041e+00
## PassSuccPerc.to.1layhid9     1.383473e+00
## AerialsWon.to.1layhid9       1.467286e+01
## Fouls.to.1layhid9            6.093975e+00
## YellowCards.to.1layhid9     -1.492150e+01
## RedCards.to.1layhid9        -1.495188e+01
## Intercept.to.1layhid10       7.781067e-01
## GoalsHT.to.1layhid10         2.224023e+00
## GoalsFT.to.1layhid10         5.646698e+00
## BallPos.to.1layhid10        -5.813911e-01
## ShotsOffTarget.to.1layhid10 -3.594223e-01
## ShotsOnTarget.to.1layhid10   1.799787e-01
## BlockedShots.to.1layhid10   -5.041433e+00
## Corners.to.1layhid10        -1.808461e+00
## PassSuccPerc.to.1layhid10   -1.353021e+00
## AerialsWon.to.1layhid10      1.455983e+00
## Fouls.to.1layhid10          -4.087517e+00
## YellowCards.to.1layhid10     5.003128e+00
## RedCards.to.1layhid10        8.952499e-01
## Intercept.to.2layhid1       -1.343696e+00
## 1layhid1.to.2layhid1        -4.668598e+00
## 1layhid2.to.2layhid1        -1.000518e+02
## 1layhid3.to.2layhid1        -3.469063e+01
## 1layhid4.to.2layhid1         2.591603e+02
## 1layhid5.to.2layhid1         3.629753e+01
## 1layhid6.to.2layhid1        -3.549011e+01
## 1layhid7.to.2layhid1        -6.470533e+01
## 1layhid8.to.2layhid1         2.127251e+02
## 1layhid9.to.2layhid1         1.433246e+00
## 1layhid10.to.2layhid1        3.464958e+02
## Intercept.to.2layhid2        2.325881e+01
## 1layhid1.to.2layhid2         7.278032e+01
## 1layhid2.to.2layhid2         8.895268e+01
## 1layhid3.to.2layhid2         4.490044e+01
## 1layhid4.to.2layhid2        -4.755002e+02
## 1layhid5.to.2layhid2         1.943281e-01
## 1layhid6.to.2layhid2         3.215772e+01
## 1layhid7.to.2layhid2        -7.916807e+02
## 1layhid8.to.2layhid2         3.216913e+02
## 1layhid9.to.2layhid2         2.371741e+02
## 1layhid10.to.2layhid2       -1.667384e+02
## Intercept.to.2layhid3       -1.231244e+00
## 1layhid1.to.2layhid3         1.695450e-01
## 1layhid2.to.2layhid3         2.599956e+00
## 1layhid3.to.2layhid3        -6.177247e+00
## 1layhid4.to.2layhid3        -2.047378e+00
## 1layhid5.to.2layhid3         3.948157e+00
## 1layhid6.to.2layhid3         3.691152e+00
## 1layhid7.to.2layhid3        -3.349184e+01
## 1layhid8.to.2layhid3         6.717702e+00
## 1layhid9.to.2layhid3        -2.767796e+00
## 1layhid10.to.2layhid3        5.228598e+00
## Intercept.to.2layhid4       -4.819153e-01
## 1layhid1.to.2layhid4         1.411761e+00
## 1layhid2.to.2layhid4        -1.990714e+00
## 1layhid3.to.2layhid4         6.502826e-01
## 1layhid4.to.2layhid4        -2.550957e+00
## 1layhid5.to.2layhid4         4.441440e-02
## 1layhid6.to.2layhid4        -8.066907e-01
## 1layhid7.to.2layhid4         1.106835e+01
## 1layhid8.to.2layhid4         1.769043e+00
## 1layhid9.to.2layhid4         1.519993e-01
## 1layhid10.to.2layhid4        8.674581e-01
## Intercept.to.2layhid5        1.901063e+00
## 1layhid1.to.2layhid5        -8.410928e-01
## 1layhid2.to.2layhid5         6.400484e-01
## 1layhid3.to.2layhid5        -2.136497e+00
## 1layhid4.to.2layhid5        -2.571489e+00
## 1layhid5.to.2layhid5        -4.516643e-02
## 1layhid6.to.2layhid5        -9.829948e-01
## 1layhid7.to.2layhid5         1.067412e+01
## 1layhid8.to.2layhid5        -2.612776e+00
## 1layhid9.to.2layhid5         7.231247e-01
## 1layhid10.to.2layhid5        2.853550e+00
## Intercept.to.3layhid1        2.158957e+00
## 2layhid1.to.3layhid1        -4.442133e+00
## 2layhid2.to.3layhid1        -2.839498e-01
## 2layhid3.to.3layhid1         3.229194e-01
## 2layhid4.to.3layhid1         4.696510e-01
## 2layhid5.to.3layhid1        -4.038692e-01
## Intercept.to.3layhid2        1.489810e+01
## 2layhid1.to.3layhid2         1.163838e+01
## 2layhid2.to.3layhid2        -7.459596e+00
## 2layhid3.to.3layhid2        -6.775711e+03
## 2layhid4.to.3layhid2        -9.338674e+01
## 2layhid5.to.3layhid2         4.379107e+02
## Intercept.to.3layhid3        3.198636e+00
## 2layhid1.to.3layhid3        -7.765468e+00
## 2layhid2.to.3layhid3         1.406738e+00
## 2layhid3.to.3layhid3        -2.478482e+00
## 2layhid4.to.3layhid3        -7.652450e+00
## 2layhid5.to.3layhid3         7.378313e-01
## Intercept.to.home            3.347748e+00
## 3layhid1.to.home            -5.611594e+01
## 3layhid2.to.home             1.431989e+00
## 3layhid3.to.home             2.027931e+03

plot(matches.nn.bigger, rep = "best", arrow.length = 0.15, information = T)

pred.nn.bigger.train <- predict(matches.nn.bigger,
                      matches.train,
                      type = "response")

nn.bigger.train.confMat = confusionMatrix(data = as.factor(ifelse(pred.nn.bigger.train > 0.5, 1, 0)), # predictions
                # actual values
                reference = as.factor(matches.train.nn$home),
                # definitions of the "success" label
                positive = "1") 
nn.bigger.train.confMat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  210  197
##          1 1652 1665
##                                           
##                Accuracy : 0.5035          
##                  95% CI : (0.4873, 0.5197)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.341           
##                                           
##                   Kappa : 0.007           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8942          
##             Specificity : 0.1128          
##          Pos Pred Value : 0.5020          
##          Neg Pred Value : 0.5160          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4471          
##    Detection Prevalence : 0.8907          
##       Balanced Accuracy : 0.5035          
##                                           
##        'Positive' Class : 1               
##

pred.nn.bigger.test <- predict(matches.nn.bigger,
                      matches.test,
                      type = "response")

nn.bigger.test.confMat = confusionMatrix(data = as.factor(ifelse(pred.nn.bigger.test > 0.5, 1, 0)), # predictions
                # actual values
                reference = as.factor(matches.test.nn$home),
                # definitions of the "success" label
                positive = "1") 
nn.bigger.test.confMat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 106  87
##          1 692 711
##                                           
##                Accuracy : 0.5119          
##                  95% CI : (0.4871, 0.5367)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.1772          
##                                           
##                   Kappa : 0.0238          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8910          
##             Specificity : 0.1328          
##          Pos Pred Value : 0.5068          
##          Neg Pred Value : 0.5492          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4455          
##    Detection Prevalence : 0.8791          
##       Balanced Accuracy : 0.5119          
##                                           
##        'Positive' Class : 1               
##

Increasing the threshold and enlarging the network resulted in worse accuracy than for the simpler neural network.

ROC.nn.train <- 
  pROC::roc(as.numeric(matches.train$home == "yes"), 
            predict(matches.nn, matches.train.nn[, -13]))

## Setting levels: control = 0, case = 1

## Warning in roc.default(as.numeric(matches.train$home == "yes"),
## predict(matches.nn, : Deprecated use a matrix as predictor. Unexpected results
## may be produced, please pass a numeric vector.

## Setting direction: controls < cases

ROC.nn.test  <- 
  pROC::roc(as.numeric(matches.test$home == "yes"), 
            predict(matches.nn, matches.test.nn[, -13]))

## Setting levels: control = 0, case = 1

## Warning in roc.default(as.numeric(matches.test$home == "yes"),
## predict(matches.nn, : Deprecated use a matrix as predictor. Unexpected results
## may be produced, please pass a numeric vector.

## Setting direction: controls < cases

list(
  ROC.nn.train = ROC.nn.train,
  ROC.nn.test  = ROC.nn.test
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "nn = ", 
                         round(100*(2 * auc(ROC.nn.train) - 1), 1), "%, ",
                         "Gini TEST: ",
                         "nn = ", 
                         round(100*(2 * auc(ROC.nn.test) - 1), 1), "%, "
                         )) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

Obtained Gini index values for test data set are much worse than for the previously presented methods.

Keras

The last method is a neural network with the use of Keras package.

The network consists of 3 dense layers with 256, 128 and 1 neurons.

library(keras)
library(tensorflow)

## 
## Dołączanie pakietu: 'tensorflow'

## Następujący obiekt został zakryty z 'package:caret':
## 
##     train

model <- keras_model_sequential()

## Loaded Tensorflow version 2.8.0

model %>%
  layer_dense(units = 256, activation = "sigmoid",
              input_shape = 12) %>%
  layer_dense(units = 128, activation = "sigmoid") %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% 
  compile(
    optimizer = "rmsprop",
    loss = "binary_crossentropy",
    metrics = tf$keras$metrics$AUC()
    # metric = "AUC"
  )


model %>% 
  fit(as.matrix(matches.train.nn[, -13]), 
      matches.train.nn[, 13],
      epochs = 500, 
      batch_size = 1024, 
      verbose = 0)

result <- evaluate(model, as.matrix(matches.test.nn[, -13]), matches.test.nn[, 13])
result

##      loss       auc 
## 0.6616802 0.6400108

ROC.keras.train <- 
  pROC::roc(matches.train.nn[, 13] == 1, 
            predict(model, as.matrix(matches.train.nn[, -13])))

## Setting levels: control = FALSE, case = TRUE

## Warning in roc.default(matches.train.nn[, 13] == 1, predict(model,
## as.matrix(matches.train.nn[, : Deprecated use a matrix as predictor. Unexpected
## results may be produced, please pass a numeric vector.

## Setting direction: controls < cases

ROC.keras.test <- 
  pROC::roc(as.numeric(matches.test.nn[, 13] == 1), 
            model %>% predict(as.matrix(matches.test.nn[, -13])) %>% as.numeric())

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

matches.pred.train.keras <- predict(model, as.matrix(matches.train.nn[, -13]))
matches.pred.test.keras <- predict(model, as.matrix(matches.test.nn[, -13]))

(confusionMatrix.train.keras <-
  confusionMatrix(as.numeric((matches.pred.train.keras > 0.5)) %>% as.factor(),
                  as.factor(ifelse(matches.train$home == "yes", 1, 0))) )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1248  839
##          1  614 1023
##                                           
##                Accuracy : 0.6098          
##                  95% CI : (0.5939, 0.6255)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2197          
##                                           
##  Mcnemar's Test P-Value : 4.192e-09       
##                                           
##             Sensitivity : 0.6702          
##             Specificity : 0.5494          
##          Pos Pred Value : 0.5980          
##          Neg Pred Value : 0.6249          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3351          
##    Detection Prevalence : 0.5604          
##       Balanced Accuracy : 0.6098          
##                                           
##        'Positive' Class : 0               
##

(confusionMatrix.keras.test <-
  confusionMatrix(as.numeric((matches.pred.test.keras > 0.5)) %>% as.factor(),
                  as.factor(ifelse(matches.test$home == "yes", 1, 0))) )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 530 369
##          1 268 429
##                                          
##                Accuracy : 0.6009         
##                  95% CI : (0.5764, 0.625)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : 3.749e-16      
##                                          
##                   Kappa : 0.2018         
##                                          
##  Mcnemar's Test P-Value : 7.428e-05      
##                                          
##             Sensitivity : 0.6642         
##             Specificity : 0.5376         
##          Pos Pred Value : 0.5895         
##          Neg Pred Value : 0.6155         
##              Prevalence : 0.5000         
##          Detection Rate : 0.3321         
##    Detection Prevalence : 0.5633         
##       Balanced Accuracy : 0.6009         
##                                          
##        'Positive' Class : 0              
##

# confusionMatrix.keras$overall[['Accuracy']]

Accuracy for train sample is 0.595 and for test sample 0.596.

list(
  ROC.keras.train = ROC.keras.train,
  ROC.keras.test  = ROC.keras.test
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "keras = ", 
                         round(100*(2 * auc(ROC.keras.train) - 1), 1), "%, ",
                         "Gini TEST: ",
                         "keras = ", 
                         round(100*(2 * auc(ROC.keras.test) - 1), 1), "%, "
                         )) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

Gini coefficient is quite high – 28.1% for the test data.

Summary and conclusions

list(
  ROC.logit.train = ROC.logit.train,
  ROC.logit.test = ROC.logit.test,
  ROC.tree.pruned.train = ROC.tree.pruned.train,
  ROC.tree.pruned.test = ROC.tree.pruned.test,
  ROC.rf2.train = ROC.rf2.train,
  ROC.rf2.test  = ROC.rf2.test,
  ROC.xgb6.train  = ROC.xgb6.train,
  ROC.xgb6.test   = ROC.xgb6.test,
  ROC.nn.train = ROC.nn.train,
  ROC.nn.test  = ROC.nn.test,
  ROC.keras.train = ROC.keras.train,
  ROC.keras.test  = ROC.keras.test
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "logit = ", 
                         round(100*(2 * auc(ROC.logit.train) - 1), 1), "%, ",
                         "tree = ",
                         round(100*(2 * auc(ROC.tree.pruned.train) - 1), 1), "%, ",
                         "RF = ", 
                         round(100*(2 * auc(ROC.rf2.train) - 1), 1), "%, ",
                         "XGB = ", 
                         round(100*(2 * auc(ROC.xgb6.train) - 1), 1), "%, ",
                         "nn = ", 
                         round(100*(2 * auc(ROC.nn.train) - 1), 1), "%, ",
                         "keras = ", 
                         round(100*(2 * auc(ROC.keras.train) - 1), 1), "%, ",
                         "\nGini TEST: ",
                         "logit = ",
                         round(100*(2 * auc(ROC.logit.test) - 1), 1), "%, ",
                         "tree = ", 
                         round(100*(2 * auc(ROC.tree.pruned.test) - 1), 1), "%, ",
                         "RF = ", 
                         round(100*(2 * auc(ROC.rf2.test) - 1), 1), "%, ",
                         "XGB = ", 
                         round(100*(2 * auc(ROC.xgb6.test) - 1), 1), "%, ",
                         "nn = ", 
                         round(100*(2 * auc(ROC.nn.test) - 1), 1), "% ",
                         "keras = ", 
                         round(100*(2 * auc(ROC.keras.test) - 1), 1), "% "
                         )) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

# round(100*(2 * auc(ROC.logit.train) - 1), 1)
# round(100*(2 * auc(ROC.tree.pruned.train) - 1), 1)
# round(100*(2 * auc(ROC.rf2.train) - 1), 1)
# round(100*(2 * auc(ROC.xgb6.train) - 1), 1)
# round(100*(2 * auc(ROC.nn.train) - 1), 1)
# round(100*(2 * auc(ROC.keras.train) - 1), 1)
# round(100*(2 * auc(ROC.logit.test) - 1), 1)
# round(100*(2 * auc(ROC.tree.pruned.test) - 1), 1)
# round(100*(2 * auc(ROC.rf2.test) - 1), 1)
# round(100*(2 * auc(ROC.xgb6.test) - 1), 1)
# round(100*(2 * auc(ROC.nn.test) - 1), 1)
# round(100*(2 * auc(ROC.keras.test) - 1), 1)

Different ML methods and classical econometric method (logistic regression) were used to predict if a team played its match as a host or a guest team. As can be seen in the above graph the best classification results in terms of the Gini coefficient were obtained with the logistic regression. Second in line is the neural network built with the Keras package.

rNames = c('logit', 'pruned tree', 'RF', 'XGB', 'nn', 'keras')
accuracies = c(confusionMatrix.logit.test$overall[['Accuracy']],
               confusionMatrix.tree.pruned.test$overall[['Accuracy']],
               confusionMatrix.rf2.test$overall[['Accuracy']],
               confusionMatrix.xgb6.test$overall[['Accuracy']],
               confusionMatrix.nn.test$overall[['Accuracy']],
               confusionMatrix.keras.test$overall[['Accuracy']])

sensitivities = c(confusionMatrix.logit.test$byClass[['Sensitivity']],
               confusionMatrix.tree.pruned.test$byClass[['Sensitivity']],
               confusionMatrix.rf2.test$byClass[['Sensitivity']],
               confusionMatrix.xgb6.test$byClass[['Sensitivity']],
               confusionMatrix.nn.test$byClass[['Sensitivity']],
               confusionMatrix.keras.test$byClass[['Sensitivity']])

specificities = c(confusionMatrix.logit.test$byClass[['Specificity']],
               confusionMatrix.tree.pruned.test$byClass[['Specificity']],
               confusionMatrix.rf2.test$byClass[['Specificity']],
               confusionMatrix.xgb6.test$byClass[['Specificity']],
               confusionMatrix.nn.test$byClass[['Specificity']],
               confusionMatrix.keras.test$byClass[['Specificity']])

balancedAccuracies = c(confusionMatrix.logit.test$byClass[['Balanced Accuracy']],
                       confusionMatrix.tree.pruned.test$byClass[['Balanced Accuracy']],
                       confusionMatrix.rf2.test$byClass[['Balanced Accuracy']],
                       confusionMatrix.xgb6.test$byClass[['Balanced Accuracy']],
                       confusionMatrix.nn.test$byClass[['Balanced Accuracy']],
                       confusionMatrix.keras.test$byClass[['Balanced Accuracy']])

F1s = c(confusionMatrix.logit.test$byClass[['F1']],
        confusionMatrix.tree.pruned.test$byClass[['F1']],
        confusionMatrix.rf2.test$byClass[['F1']],
        confusionMatrix.xgb6.test$byClass[['F1']],
        confusionMatrix.nn.test$byClass[['F1']],
        confusionMatrix.keras.test$byClass[['F1']])

sum = data.frame(Accuracy = accuracies, Sensitivity = sensitivities,
                 Specificity = specificities, 'Balanced Accuracy' = balancedAccuracies,
                 F1 = F1s,
                 row.names = rNames)

sum

##              Accuracy Sensitivity Specificity Balanced.Accuracy        F1
## logit       0.6040100   0.5576441   0.6503759         0.6040100 0.5847569
## pruned tree 0.5795739   0.6115288   0.5476190         0.5795739 0.5925926
## RF          0.5946115   0.5551378   0.6340852         0.5946115 0.5779517
## XGB         0.6058897   0.5814536   0.6303258         0.6058897 0.5960180
## nn          0.5620301   0.4423559   0.6817043         0.5620301 0.5024911
## keras       0.6008772   0.6641604   0.5375940         0.6008772 0.6246317

Accuracy of classification is quite similar for all methods. Sensitivity is the best for neural network built with “Keras” package, while the worst is for the neural network built with the “neuralnet” package. Specificity is the best for neural network from “neuralnet” package, while the worst for neural network from “Keras” package – the opposite to the sensitivity. Balanced accuracy statistic is the best for the XGB. High value of this statistic is also obtained for logistic regression. F1 statistic is noticeably the best for neural network from “Keras” package, the worst is for “neuralnet” neural network.

Presented values of the statistics show that the predictive value of our models is not perfect. There is still some room for possible improvements, for example testing another neural networks layouts.

Bibliography

Carmichael, Fiona, and Dennis Thomas. 2005. “Home-Field Effect and Team Performance: Evidence from English Premiership Football.” Journal of Sports Economics 6 (3): 264–81. https://doi.org/10.1177/1527002504266154.

Does a team play home or away?
Predicting home/away classes with ML methods.

Sylwia Wrona, Błażej Popławski

Introduction

Read and transform data

Logistic regression

Classification tree

Variable importance

Random forest

XGBoost

Hyperparameters optimization:

Neural networks

Neuralnet

Keras

Summary and conclusions

Bibliography

Does a team play home or away? Predicting home/away classes with ML methods.

Sylwia Wrona, Błażej Popławski

Introduction

Read and transform data

Logistic regression

Classification tree

Variable importance

Random forest

XGBoost

Hyperparameters optimization:

Neural networks

Neuralnet

Keras

Summary and conclusions

Bibliography

Does a team play home or away?
Predicting home/away classes with ML methods.