The project is aimed at predicting whether a team was a host of a match or a guest. To achieve this we gathered the data for seven seasons of the English Premier League. Our self-built web scraper was utilized for this task.
Similar topics were investigated by some scholars in the past, e.g. Carmichael and Thomas (2005) assessed the existence of a home field effect. They conducted the analysis for English Premier League and found that attacking play is more productive for home teams and that defensive play is more important for away teams.
We utilized a handful of ML methods, as well as the classical logistic regression, which served as a benchmark. Exact methods are:
library(readr)
matches_org <- read_delim("meczeTrans.csv", delim = ";", escape_double = FALSE, locale = locale(decimal_mark = ","), trim_ws = TRUE)
# Every other row contains statistics either for a home or away team.
home = rep(c(1, 0), times = length(matches_org$MatchID)/2)
matches_org$home = home
# dim(matches_org)
matches_s = subset(matches_org, select= -c(...1, MatchID, MatchDate, Week, TeamName))
# Convert RedCards to binary variable
matches_s$RedCards[matches_s$RedCards > 0] = 1
matchesF = matches_s
dim(matchesF)
## [1] 5320 13
head(matchesF)
## # A tibble: 6 x 13
## GoalsHT GoalsFT BallPos ShotsOffTarget ShotsOnTarget BlockedShots Corners
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 1 0.6 5 5 4 4
## 2 1 2 0.4 0 4 1 0
## 3 0 0 0.51 7 6 6 8
## 4 0 1 0.49 3 4 4 9
## 5 0 0 0.63 4 2 6 2
## 6 0 1 0.37 4 1 2 8
## # ... with 6 more variables: PassSuccPerc <dbl>, AerialsWon <dbl>, Fouls <dbl>,
## # YellowCards <dbl>, RedCards <dbl>, home <dbl>
Each row in the data set contains statistics for one team. Two consecutive rows (starting from the first one) contain statistics of one match – for home and away team of this match.
There are 17 variables in the original data set:
First four were discarded since they are irrelevant for the sake of conducted analysis.
Due to a very low cases of more than 1 red cards given during a match this variable was recoded into binary one which says if the team got red card or not. / / High correlation is not very harmful for the utilized ML methods, however it should be addressed in the beginning.
library(corrplot)
library(vtable)
testRes = cor.mtest(matchesF, conf.level = 0.95) # Significance of the correlation
corrplot(cor(matchesF), p.mat = testRes$p, type = 'lower', method = 'number',
insig = 'blank', tl.cex=3)
# myTable = summary_table(matchesF)
# myTable
# table(matchesF)
Number of goals scored in the first half and during the whole match are highly correlated (0.7) – this result is intuitive. High correlation is also observed for ball possession and percentage of successful passes – this is also intuitive since if a team has high possession its players must pass the ball to each other very often and with high accuracy.
sumtable(matchesF)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| GoalsHT | 5320 | 0.608 | 0.802 | 0 | 0 | 1 | 5 |
| GoalsFT | 5320 | 1.356 | 1.248 | 0 | 0 | 2 | 9 |
| BallPos | 5320 | 0.5 | 0.125 | 0.17 | 0.41 | 0.59 | 0.83 |
| ShotsOffTarget | 5320 | 4.877 | 2.609 | 0 | 3 | 6 | 16 |
| ShotsOnTarget | 5320 | 4.279 | 2.487 | 0 | 2 | 6 | 17 |
| BlockedShots | 5320 | 3.43 | 2.415 | 0 | 2 | 5 | 19 |
| Corners | 5320 | 5.24 | 2.962 | 0 | 3 | 7 | 19 |
| PassSuccPerc | 5320 | 0.776 | 0.076 | 0.48 | 0.73 | 0.83 | 0.94 |
| AerialsWon | 5320 | 18.201 | 7.362 | 1 | 13 | 23 | 67 |
| Fouls | 5320 | 10.814 | 3.481 | 0 | 8 | 13 | 26 |
| YellowCards | 5320 | 1.665 | 1.256 | 0 | 1 | 2 | 9 |
| RedCards | 5320 | 0.037 | 0.188 | 0 | 0 | 0 | 1 |
| home | 5320 | 0.5 | 0.5 | 0 | 0 | 1 | 1 |
Number of observations equals 5320. The lowest mean value has the variable saying how many red cards were given during a match (0.037), the highest mean value has the variable informing about the number of aerial duels won by the team (18.2). Variable representing the number of red cards given is very dispersed – its standard deviation is about 5 times bigger than its mean value.
table(matchesF$home)
##
## 0 1
## 2660 2660
The data is perfectly balanced since for every host team there must be one guest team.
We divide the data set into the training and testing sample – training one is 0.7 of the original data set. Other 0.3 is reserved for the testing purposes.
library(caret)
set.seed(123456789)
training_obs <- createDataPartition(matchesF$home,
p = 0.7,
list = FALSE)
matches.train <- matchesF[training_obs,]
matches.test <- matchesF[-training_obs,]
Let us define the formula of the model used for every method.
model1.formula <- home ~ GoalsHT + GoalsFT + BallPos + ShotsOffTarget + ShotsOnTarget +
BlockedShots + Corners + PassSuccPerc + AerialsWon + Fouls + YellowCards + RedCards
First we use the classical econometric method which will serve as a benchmark for other methods used.
matches.logit = glm(formula = model1.formula, data = matches.train, family = binomial(link = 'logit'))
summary(matches.logit)
##
## Call:
## glm(formula = model1.formula, family = binomial(link = "logit"),
## data = matches.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1666 -1.0983 -0.1234 1.1303 1.7621
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.857702 0.543155 -1.579 0.11431
## GoalsHT 0.118168 0.060397 1.957 0.05041 .
## GoalsFT 0.075215 0.044843 1.677 0.09349 .
## BallPos -0.859233 0.498681 -1.723 0.08489 .
## ShotsOffTarget 0.102572 0.015165 6.764 1.35e-11 ***
## ShotsOnTarget 0.053454 0.018992 2.815 0.00489 **
## BlockedShots 0.072680 0.017614 4.126 3.69e-05 ***
## Corners 0.073989 0.014765 5.011 5.41e-07 ***
## PassSuccPerc -0.071209 0.808122 -0.088 0.92978
## AerialsWon 0.001282 0.005495 0.233 0.81548
## Fouls -0.005891 0.010765 -0.547 0.58418
## YellowCards -0.094950 0.029946 -3.171 0.00152 **
## RedCards 0.061361 0.182208 0.337 0.73630
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5162.6 on 3723 degrees of freedom
## Residual deviance: 4877.9 on 3711 degrees of freedom
## AIC: 4903.9
##
## Number of Fisher Scoring iterations: 4
Many of the variables in the model are statistically significant.
Percentage of successful passes, number of aerial duels won, number of fouls commited and getting a red card are not statistically significant for predicting if the team was a host or a guest.
Look at confusion matrices:
pred.logit.train <- predict(matches.logit,
matches.train,
type = "response")
logit.train.confMat = confusionMatrix(data = as.factor(ifelse(pred.logit.train > 0.5, 1, 0)), # predictions
# actual values
reference = as.factor(matches.train$home),
# definitions of the "success" label
positive = "1")
logit.train.confMat
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1204 804
## 1 658 1058
##
## Accuracy : 0.6074
## 95% CI : (0.5915, 0.6231)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2148
##
## Mcnemar's Test P-Value : 0.0001493
##
## Sensitivity : 0.5682
## Specificity : 0.6466
## Pos Pred Value : 0.6166
## Neg Pred Value : 0.5996
## Prevalence : 0.5000
## Detection Rate : 0.2841
## Detection Prevalence : 0.4608
## Balanced Accuracy : 0.6074
##
## 'Positive' Class : 1
##
pred.logit.test <- predict(matches.logit,
matches.test,
type = "response")
confusionMatrix.logit.test = confusionMatrix(data = as.factor(ifelse(pred.logit.test > 0.5, 1, 0)), # predictions
# actual values
reference = as.factor(matches.test$home),
# definitions of the "success" label
positive = "1")
confusionMatrix.logit.test
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 519 353
## 1 279 445
##
## Accuracy : 0.604
## 95% CI : (0.5795, 0.6281)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.208
##
## Mcnemar's Test P-Value : 0.003687
##
## Sensitivity : 0.5576
## Specificity : 0.6504
## Pos Pred Value : 0.6146
## Neg Pred Value : 0.5952
## Prevalence : 0.5000
## Detection Rate : 0.2788
## Detection Prevalence : 0.4536
## Balanced Accuracy : 0.6040
##
## 'Positive' Class : 1
##
Accuracy for both, training and testing data samples are around 0.6.
Gini coefficient plot:
library(pROC)
pred.train.logit = predict(matches.logit, matches.train, type = "response")
ROC.logit.train = roc((matches.train$home == 1),
pred.train.logit)
pred.test.logit = predict(matches.logit, matches.test, type = "response")
ROC.logit.test = roc((matches.test$home == 1),
pred.test.logit)
list(
ROC.logit.train = ROC.logit.train,
ROC.logit.test = ROC.logit.test
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
round(100*(2 * auc(ROC.logit.train) - 1), 1), "%, ",
"Gini TEST: ",
round(100*(2 * auc(ROC.logit.test) - 1), 1), "%, ")) +
theme_bw() + coord_fixed() +
# scale_color_brewer(palette = "Paired") +
scale_color_manual(values = RColorBrewer::brewer.pal(n = 4,
name = "Paired")[c(1, 3, 5)])
The model is definitely not overtrained – Gini coefficient values are almost the same for both samples.
The second method used are classification trees.
library(rpart)
library(rpart.plot)
matches.tree <-
rpart(model1.formula, # model formula
data = matches.train, # data
method = "class") # type of the tree: classification
library(rattle)
matches.tree
## n= 3724
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3724 1862 0 (0.5000000 0.5000000)
## 2) Corners< 6.5 2572 1149 0 (0.5532659 0.4467341)
## 4) ShotsOffTarget< 4.5 1456 575 0 (0.6050824 0.3949176) *
## 5) ShotsOffTarget>=4.5 1116 542 1 (0.4856631 0.5143369)
## 10) ShotsOnTarget< 5.5 834 395 0 (0.5263789 0.4736211)
## 20) BlockedShots< 3.5 525 227 0 (0.5676190 0.4323810) *
## 21) BlockedShots>=3.5 309 141 1 (0.4563107 0.5436893) *
## 11) ShotsOnTarget>=5.5 282 103 1 (0.3652482 0.6347518) *
## 3) Corners>=6.5 1152 439 1 (0.3810764 0.6189236) *
# rpart.plot(matches.tree)
fancyRpartPlot(matches.tree)
We have 5 terminal nodes.
summary(matches.tree)
## Call:
## rpart(formula = model1.formula, data = matches.train, method = "class")
## n= 3724
##
## CP nsplit rel error xerror xstd
## 1 0.14715360 0 1.0000000 1.0477981 0.01636811
## 2 0.02040816 1 0.8528464 0.8689581 0.01624553
## 3 0.01450054 3 0.8120301 0.8453276 0.01618964
## 4 0.01000000 4 0.7975295 0.8367347 0.01616696
##
## Variable importance
## Corners ShotsOffTarget BlockedShots ShotsOnTarget BallPos
## 45 20 12 12 7
## PassSuccPerc GoalsFT GoalsHT
## 2 2 1
##
## Node number 1: 3724 observations, complexity param=0.1471536
## predicted class=0 expected loss=0.5 P(node) =1
## class counts: 1862 1862
## probabilities: 0.500 0.500
## left son=2 (2572 obs) right son=3 (1152 obs)
## Primary splits:
## Corners < 6.5 to the left, improve=47.17994, (0 missing)
## ShotsOffTarget < 4.5 to the left, improve=40.65040, (0 missing)
## ShotsOnTarget < 4.5 to the left, improve=38.49781, (0 missing)
## BlockedShots < 4.5 to the left, improve=33.99681, (0 missing)
## BallPos < 0.395 to the left, improve=23.75711, (0 missing)
## Surrogate splits:
## BlockedShots < 5.5 to the left, agree=0.733, adj=0.138, (0 split)
## BallPos < 0.645 to the left, agree=0.726, adj=0.115, (0 split)
## ShotsOffTarget < 9.5 to the left, agree=0.710, adj=0.062, (0 split)
## ShotsOnTarget < 8.5 to the left, agree=0.697, adj=0.021, (0 split)
## PassSuccPerc < 0.875 to the left, agree=0.693, adj=0.008, (0 split)
##
## Node number 2: 2572 observations, complexity param=0.02040816
## predicted class=0 expected loss=0.4467341 P(node) =0.6906552
## class counts: 1423 1149
## probabilities: 0.553 0.447
## left son=4 (1456 obs) right son=5 (1116 obs)
## Primary splits:
## ShotsOffTarget < 4.5 to the left, improve=18.01913, (0 missing)
## ShotsOnTarget < 4.5 to the left, improve=17.03385, (0 missing)
## GoalsFT < 1.5 to the left, improve=14.84674, (0 missing)
## BlockedShots < 3.5 to the left, improve=14.27694, (0 missing)
## GoalsHT < 0.5 to the left, improve=10.63054, (0 missing)
## Surrogate splits:
## BallPos < 0.515 to the left, agree=0.617, adj=0.117, (0 split)
## PassSuccPerc < 0.825 to the left, agree=0.600, adj=0.078, (0 split)
## BlockedShots < 4.5 to the left, agree=0.593, adj=0.063, (0 split)
## Corners < 5.5 to the left, agree=0.584, adj=0.040, (0 split)
## ShotsOnTarget < 8.5 to the left, agree=0.576, adj=0.023, (0 split)
##
## Node number 3: 1152 observations
## predicted class=1 expected loss=0.3810764 P(node) =0.3093448
## class counts: 439 713
## probabilities: 0.381 0.619
##
## Node number 4: 1456 observations
## predicted class=0 expected loss=0.3949176 P(node) =0.3909774
## class counts: 881 575
## probabilities: 0.605 0.395
##
## Node number 5: 1116 observations, complexity param=0.02040816
## predicted class=1 expected loss=0.4856631 P(node) =0.2996778
## class counts: 542 574
## probabilities: 0.486 0.514
## left son=10 (834 obs) right son=11 (282 obs)
## Primary splits:
## ShotsOnTarget < 5.5 to the left, improve=10.943020, (0 missing)
## BlockedShots < 4.5 to the left, improve= 8.037236, (0 missing)
## ShotsOffTarget < 7.5 to the left, improve= 5.699053, (0 missing)
## GoalsFT < 1.5 to the left, improve= 5.278801, (0 missing)
## Corners < 4.5 to the left, improve= 4.547056, (0 missing)
## Surrogate splits:
## GoalsFT < 2.5 to the left, agree=0.793, adj=0.181, (0 split)
## GoalsHT < 2.5 to the left, agree=0.762, adj=0.057, (0 split)
## PassSuccPerc < 0.895 to the left, agree=0.760, adj=0.050, (0 split)
## BallPos < 0.735 to the left, agree=0.750, adj=0.011, (0 split)
## ShotsOffTarget < 12.5 to the left, agree=0.749, adj=0.007, (0 split)
##
## Node number 10: 834 observations, complexity param=0.01450054
## predicted class=0 expected loss=0.4736211 P(node) =0.2239527
## class counts: 439 395
## probabilities: 0.526 0.474
## left son=20 (525 obs) right son=21 (309 obs)
## Primary splits:
## BlockedShots < 3.5 to the left, improve=4.819893, (0 missing)
## GoalsFT < 3.5 to the left, improve=4.188806, (0 missing)
## ShotsOffTarget < 10.5 to the left, improve=2.907622, (0 missing)
## BallPos < 0.285 to the left, improve=1.915287, (0 missing)
## Corners < 0.5 to the left, improve=1.557683, (0 missing)
## Surrogate splits:
## Corners < 4.5 to the left, agree=0.658, adj=0.078, (0 split)
## BallPos < 0.635 to the left, agree=0.652, adj=0.061, (0 split)
## ShotsOffTarget < 7.5 to the left, agree=0.645, adj=0.042, (0 split)
## PassSuccPerc < 0.895 to the left, agree=0.638, adj=0.023, (0 split)
## AerialsWon < 5.5 to the right, agree=0.632, adj=0.006, (0 split)
##
## Node number 11: 282 observations
## predicted class=1 expected loss=0.3652482 P(node) =0.07572503
## class counts: 103 179
## probabilities: 0.365 0.635
##
## Node number 20: 525 observations
## predicted class=0 expected loss=0.432381 P(node) =0.1409774
## class counts: 298 227
## probabilities: 0.568 0.432
##
## Node number 21: 309 observations
## predicted class=1 expected loss=0.4563107 P(node) =0.0829753
## class counts: 141 168
## probabilities: 0.456 0.544
pred.tree <- predict(matches.tree,
matches.train,
type = "class")
confusionMatrix(data = pred.tree, # predictions
# actual values
reference = as.factor(matches.train$home),
# definitions of the "success" label
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1179 802
## 1 683 1060
##
## Accuracy : 0.6012
## 95% CI : (0.5853, 0.617)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2025
##
## Mcnemar's Test P-Value : 0.002198
##
## Sensitivity : 0.5693
## Specificity : 0.6332
## Pos Pred Value : 0.6081
## Neg Pred Value : 0.5952
## Prevalence : 0.5000
## Detection Rate : 0.2846
## Detection Prevalence : 0.4680
## Balanced Accuracy : 0.6012
##
## 'Positive' Class : 1
##
The accuracy of this tree is equal to 0.6 – the same as for logistic regression.
Lets try some bigger one.
Big tree:
matches.tree.big <-
rpart(model1.formula,
data = matches.train,
method = "class",
minsplit = 500, # ~ 2% of the training set
minbucket = 250, # ~ 1% of the training set
maxdepth = 30, # default
cp = -1)
matches.tree.big
## n= 3724
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3724 1862 0 (0.5000000 0.5000000)
## 2) Corners< 6.5 2572 1149 0 (0.5532659 0.4467341)
## 4) ShotsOffTarget< 4.5 1456 575 0 (0.6050824 0.3949176)
## 8) GoalsHT< 0.5 774 264 0 (0.6589147 0.3410853)
## 16) Fouls>=9.5 485 151 0 (0.6886598 0.3113402) *
## 17) Fouls< 9.5 289 113 0 (0.6089965 0.3910035) *
## 9) GoalsHT>=0.5 682 311 0 (0.5439883 0.4560117)
## 18) AerialsWon>=15.5 386 160 0 (0.5854922 0.4145078) *
## 19) AerialsWon< 15.5 296 145 1 (0.4898649 0.5101351) *
## 5) ShotsOffTarget>=4.5 1116 542 1 (0.4856631 0.5143369)
## 10) ShotsOnTarget< 5.5 834 395 0 (0.5263789 0.4736211)
## 20) BlockedShots< 3.5 525 227 0 (0.5676190 0.4323810)
## 40) AerialsWon< 18.5 260 107 0 (0.5884615 0.4115385) *
## 41) AerialsWon>=18.5 265 120 0 (0.5471698 0.4528302) *
## 21) BlockedShots>=3.5 309 141 1 (0.4563107 0.5436893) *
## 11) ShotsOnTarget>=5.5 282 103 1 (0.3652482 0.6347518) *
## 3) Corners>=6.5 1152 439 1 (0.3810764 0.6189236)
## 6) ShotsOffTarget< 7.5 842 354 1 (0.4204276 0.5795724)
## 12) BallPos< 0.605 553 255 1 (0.4611212 0.5388788)
## 24) ShotsOffTarget< 4.5 255 126 1 (0.4941176 0.5058824) *
## 25) ShotsOffTarget>=4.5 298 129 1 (0.4328859 0.5671141) *
## 13) BallPos>=0.605 289 99 1 (0.3425606 0.6574394) *
## 7) ShotsOffTarget>=7.5 310 85 1 (0.2741935 0.7258065) *
fancyRpartPlot(matches.tree.big)
pred.tree.big <- predict(matches.tree.big,
matches.train,
type = "class")
confusionMatrix(data = pred.tree.big, # predictions
# actual values
reference = as.factor(matches.train$home),
# definitions of the "success" label
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1034 651
## 1 828 1211
##
## Accuracy : 0.6028
## 95% CI : (0.5869, 0.6186)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2057
##
## Mcnemar's Test P-Value : 4.729e-06
##
## Sensitivity : 0.6504
## Specificity : 0.5553
## Pos Pred Value : 0.5939
## Neg Pred Value : 0.6136
## Prevalence : 0.5000
## Detection Rate : 0.3252
## Detection Prevalence : 0.5475
## Balanced Accuracy : 0.6028
##
## 'Positive' Class : 1
##
Letting a tree to grow bigger didn’t result in much better results.
Now the bigger tree will be pruned.
opt <- which.min(matches.tree.big$cptable[, "xerror"])
cp <- matches.tree.big$cptable[opt, "CP"]
matches.tree.pruned <-
prune(matches.tree.big, cp = cp)
matches.tree.pruned
## n= 3724
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3724 1862 0 (0.5000000 0.5000000)
## 2) Corners< 6.5 2572 1149 0 (0.5532659 0.4467341)
## 4) ShotsOffTarget< 4.5 1456 575 0 (0.6050824 0.3949176)
## 8) GoalsHT< 0.5 774 264 0 (0.6589147 0.3410853) *
## 9) GoalsHT>=0.5 682 311 0 (0.5439883 0.4560117)
## 18) AerialsWon>=15.5 386 160 0 (0.5854922 0.4145078) *
## 19) AerialsWon< 15.5 296 145 1 (0.4898649 0.5101351) *
## 5) ShotsOffTarget>=4.5 1116 542 1 (0.4856631 0.5143369)
## 10) ShotsOnTarget< 5.5 834 395 0 (0.5263789 0.4736211)
## 20) BlockedShots< 3.5 525 227 0 (0.5676190 0.4323810) *
## 21) BlockedShots>=3.5 309 141 1 (0.4563107 0.5436893) *
## 11) ShotsOnTarget>=5.5 282 103 1 (0.3652482 0.6347518) *
## 3) Corners>=6.5 1152 439 1 (0.3810764 0.6189236) *
fancyRpartPlot(matches.tree.pruned)
The pruned tree is identical to the first developed tree.
pred.tree.train = predict(matches.tree, matches.train)
pred.big.tree.train <- predict(matches.tree.big, matches.train)
pred.tree.pruned.train <- predict(matches.tree.pruned, matches.train)
ROC.tree.train = roc((matches.train$home == 1),
pred.tree.train[, 1])
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls > cases
ROC.tree.big.train <- roc((matches.train$home == 1),
pred.big.tree.train[, 1])
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls > cases
ROC.tree.pruned.train <- roc((matches.train$home == 1),
pred.tree.pruned.train[, 1])
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls > cases
list(
ROC.tree.train = ROC.tree.train,
ROC.tree.big.train = ROC.tree.big.train,
ROC.tree.pruned.train = ROC.tree.pruned.train
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"tree = ",
round(100*(2 * auc(ROC.tree.train) - 1), 1), "%, ",
"big = ",
round(100*(2 * auc(ROC.tree.big.train) - 1), 1), "%, ",
"pruned = ",
round(100*(2 * auc(ROC.tree.pruned.train) - 1), 1), "% ")) +
theme_bw() + coord_fixed() +
# scale_color_brewer(palette = "Paired") +
scale_color_manual(values = RColorBrewer::brewer.pal(n = 4,
name = "Paired")[c(1, 3, 5)])
Pruned big tree gives better results than the first tree.
Now for the testing set.
pred.tree.test = predict(matches.tree, matches.test)
pred.tree.big.test <- predict(matches.tree.big, matches.test)
pred.tree.pruned.test <- predict(matches.tree.pruned, matches.test)
ROC.tree.test = roc((matches.test$home == 1),
pred.tree.test[, 1])
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls > cases
ROC.big.tree.test <- roc((matches.test$home == 1),
pred.tree.big.test[, 1])
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls > cases
ROC.tree.pruned.test <- roc((matches.test$home == 1),
pred.tree.pruned.test[, 1])
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls > cases
list(
ROC.tree.train = ROC.tree.train,
ROC.tree.big.train = ROC.tree.big.train,
ROC.tree.pruned.train = ROC.tree.pruned.train,
ROC.tree.test = ROC.tree.test,
ROC.big.tree.test = ROC.big.tree.test,
ROC.tree.pruned.test = ROC.tree.pruned.test
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"tree = ",
round(100*(2 * auc(ROC.tree.train) - 1), 1), "%, ",
"big = ",
round(100*(2 * auc(ROC.tree.big.train) - 1), 1), "%, ",
"pruned = ",
round(100*(2 * auc(ROC.tree.pruned.train) - 1), 1), "% ",
"Gini TEST: ",
"tree = ",
round(100*(2 * auc(ROC.tree.test) - 1), 1), "%, ",
"big = ",
round(100*(2 * auc(ROC.big.tree.test) - 1), 1), "%, ",
"pruned = ",
round(100*(2 * auc(ROC.tree.pruned.test) - 1), 1), "% "
)) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
# scale_color_manual(values = RColorBrewer::brewer.pal(n = 4, name = "Paired")[c(1, 3, 5)])
For the test data set also the pruned big tree works best.
confusionMatrix.tree.pruned.test = confusionMatrix(data = predict(matches.tree.pruned,
matches.test,
type = "class"), # predictions
# actual values
reference = as.factor(matches.test$home),
# definitions of the "success" label
positive = "1")
Lets check which variables are the most important for the best tree.
tree.pruned.importance <- matches.tree.pruned$variable.importance
par(mar = c(5.1, 6.1, 4.1, 2.1))
barplot(rev(tree.pruned.importance), # vactor
col = "blue", # colors
main = "imporatnce of variables in model matches.tree.pruned",
horiz = T, # horizontal type of plot
las = 1, # labels always horizontally
cex.names = 0.6)
We see that the most important is the number of corners, number of shots off and on target, as well as number of blocked shots. When we compare this results to the statistics of logistic regression we’ll see that the order of variables is the same.
Next step was to develop a random forest.
## Comment from classes
# For the classification analysis (nominal target variable) we have to transform the dependent variable to factor. Otherwise, if we leave the target variable as numeric - the regression analysis will be performed.
str(matches.train) # Now the 'home' variable is numeric
## tibble [3,724 x 13] (S3: tbl_df/tbl/data.frame)
## $ GoalsHT : num [1:3724] 0 0 0 0 0 1 1 1 2 0 ...
## $ GoalsFT : num [1:3724] 1 0 1 0 1 2 2 2 2 0 ...
## $ BallPos : num [1:3724] 0.6 0.51 0.49 0.63 0.37 0.58 0.42 0.37 0.63 0.47 ...
## $ ShotsOffTarget: num [1:3724] 5 7 3 4 4 5 2 5 5 10 ...
## $ ShotsOnTarget : num [1:3724] 5 6 4 2 1 5 2 3 3 4 ...
## $ BlockedShots : num [1:3724] 4 6 4 6 2 0 3 3 5 4 ...
## $ Corners : num [1:3724] 4 8 9 2 8 6 3 3 6 8 ...
## $ PassSuccPerc : num [1:3724] 0.86 0.77 0.76 0.84 0.68 0.8 0.75 0.77 0.84 0.83 ...
## $ AerialsWon : num [1:3724] 20 30 15 30 9 16 15 27 14 15 ...
## $ Fouls : num [1:3724] 14 10 10 14 9 18 9 16 10 12 ...
## $ YellowCards : num [1:3724] 2 1 2 0 3 3 1 1 1 2 ...
## $ RedCards : num [1:3724] 0 0 0 0 0 0 0 0 0 0 ...
## $ home : num [1:3724] 1 1 0 1 0 1 0 1 0 1 ...
matches.train$home = as.factor(matches.train$home)
matches.test$home = as.factor(matches.test$home)
str(matches.train)
## tibble [3,724 x 13] (S3: tbl_df/tbl/data.frame)
## $ GoalsHT : num [1:3724] 0 0 0 0 0 1 1 1 2 0 ...
## $ GoalsFT : num [1:3724] 1 0 1 0 1 2 2 2 2 0 ...
## $ BallPos : num [1:3724] 0.6 0.51 0.49 0.63 0.37 0.58 0.42 0.37 0.63 0.47 ...
## $ ShotsOffTarget: num [1:3724] 5 7 3 4 4 5 2 5 5 10 ...
## $ ShotsOnTarget : num [1:3724] 5 6 4 2 1 5 2 3 3 4 ...
## $ BlockedShots : num [1:3724] 4 6 4 6 2 0 3 3 5 4 ...
## $ Corners : num [1:3724] 4 8 9 2 8 6 3 3 6 8 ...
## $ PassSuccPerc : num [1:3724] 0.86 0.77 0.76 0.84 0.68 0.8 0.75 0.77 0.84 0.83 ...
## $ AerialsWon : num [1:3724] 20 30 15 30 9 16 15 27 14 15 ...
## $ Fouls : num [1:3724] 14 10 10 14 9 18 9 16 10 12 ...
## $ YellowCards : num [1:3724] 2 1 2 0 3 3 1 1 1 2 ...
## $ RedCards : num [1:3724] 0 0 0 0 0 0 0 0 0 0 ...
## $ home : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 1 2 ...
levels(matches.train$home) = c('no', 'yes')
levels(matches.test$home) = c('no', 'yes')
str(matches.train) # Change from '0/1' to 'no/yes'
## tibble [3,724 x 13] (S3: tbl_df/tbl/data.frame)
## $ GoalsHT : num [1:3724] 0 0 0 0 0 1 1 1 2 0 ...
## $ GoalsFT : num [1:3724] 1 0 1 0 1 2 2 2 2 0 ...
## $ BallPos : num [1:3724] 0.6 0.51 0.49 0.63 0.37 0.58 0.42 0.37 0.63 0.47 ...
## $ ShotsOffTarget: num [1:3724] 5 7 3 4 4 5 2 5 5 10 ...
## $ ShotsOnTarget : num [1:3724] 5 6 4 2 1 5 2 3 3 4 ...
## $ BlockedShots : num [1:3724] 4 6 4 6 2 0 3 3 5 4 ...
## $ Corners : num [1:3724] 4 8 9 2 8 6 3 3 6 8 ...
## $ PassSuccPerc : num [1:3724] 0.86 0.77 0.76 0.84 0.68 0.8 0.75 0.77 0.84 0.83 ...
## $ AerialsWon : num [1:3724] 20 30 15 30 9 16 15 27 14 15 ...
## $ Fouls : num [1:3724] 14 10 10 14 9 18 9 16 10 12 ...
## $ YellowCards : num [1:3724] 2 1 2 0 3 3 1 1 1 2 ...
## $ RedCards : num [1:3724] 0 0 0 0 0 0 0 0 0 0 ...
## $ home : Factor w/ 2 levels "no","yes": 2 2 1 2 1 2 1 2 1 2 ...
Build random forest with default values.
library(randomForest)
set.seed(123456789)
matches.rf <- randomForest(model1.formula,
data = matches.train)
print(matches.rf)
##
## Call:
## randomForest(formula = model1.formula, data = matches.train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 41.49%
## Confusion matrix:
## no yes class.error
## no 1135 727 0.3904404
## yes 818 1044 0.4393126
plot(matches.rf)
The OOB error rate is 41.49%. Lets try some bigger forest.
matches.rf2 <-
randomForest(model1.formula,
data = matches.train,
ntree = 500,
sampsize = nrow(matches.train),
mtry = 8, # number of predictors sampled for splitting
nodesize = 200, # minimum number of obs in the terminal nodes
importance = TRUE) # we also generate predictors importance measures
print(matches.rf2)
##
## Call:
## randomForest(formula = model1.formula, data = matches.train, ntree = 500, sampsize = nrow(matches.train), mtry = 8, nodesize = 200, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 39.77%
## Confusion matrix:
## no yes class.error
## no 1185 677 0.3635875
## yes 804 1058 0.4317938
plot(matches.rf2)
The OOB error rate is slightly lower – 39.77%.
Now optimize hyperparameters.
mtry parameter optimization:
parameters_rf <- expand.grid(mtry = 2:12)
ctrl_oob <- trainControl(method = "oob", classProbs = TRUE)
library(here)
## here() starts at C:/Users/bpop/OneDrive/R/ML2/proj
set.seed(123456789)
matches.rf3 <-
train(model1.formula,
data = matches.train,
method = "rf",
ntree = 100,
nodesize = 100,
tuneGrid = parameters_rf,
trControl = ctrl_oob,
importance = TRUE)
# saving the object to the external file
# saveRDS(object = matches.rf3,
# file = here("matches.rf3.rds"))
# matches.rf3 <- readRDS(here("matches.rf3.rds"))
matches.rf3
## Random Forest
##
## 3724 samples
## 12 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.5923738 0.1847476
## 3 0.6033835 0.2067669
## 4 0.5950591 0.1901182
## 5 0.5945220 0.1890440
## 6 0.6068743 0.2137487
## 7 0.5996241 0.1992481
## 8 0.5982814 0.1965628
## 9 0.5931794 0.1863588
## 10 0.5950591 0.1901182
## 11 0.5972073 0.1944146
## 12 0.5996241 0.1992481
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
The highest accuracy is obtained for mtry = 6.
plot(matches.rf3$results$mtry, matches.rf3$results$Accuracy, type = "b")
Now we are using the ‘ranger’ package for random forest estimation and hyperparameters optimization.
parameters_ranger <-
expand.grid(mtry = 2:12,
# split rule
splitrule = "gini",
# minimum size of the terminal node
min.node.size = c(100, 250, 500))
ctrl_cv5 <- trainControl(method = "cv",
number = 5,
classProbs = T)
set.seed(123456789)
matches.rf3a <-
train(model1.formula,
data = matches.train,
method = "ranger",
num.trees = 100, # default = 500
# numbers of processor cores to use in computations
num.threads = 3,
# impurity measure
importance = "impurity",
# parameters
tuneGrid = parameters_ranger,
trControl = ctrl_cv5)
# saving the object to the external file
# saveRDS(matches.rf3a, here("matches.rf3a.rds"))
matches.rf3a
## Random Forest
##
## 3724 samples
## 12 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 2980, 2978, 2979, 2979, 2980
## Resampling results across tuning parameters:
##
## mtry min.node.size Accuracy Kappa
## 2 100 0.5964123 0.1928502
## 2 250 0.5972155 0.1944611
## 2 500 0.5896959 0.1794188
## 3 100 0.5953414 0.1907253
## 3 250 0.6001729 0.2003755
## 3 500 0.5961446 0.1923209
## 4 100 0.5945407 0.1891249
## 4 250 0.6017883 0.2036058
## 4 500 0.5977543 0.1955444
## 5 100 0.5964113 0.1928565
## 5 250 0.5958754 0.1917920
## 5 500 0.5974862 0.1950079
## 6 100 0.5982980 0.1966212
## 6 250 0.5969511 0.1939298
## 6 500 0.5990933 0.1982128
## 7 100 0.5953356 0.1907026
## 7 250 0.5942661 0.1885730
## 7 500 0.5977553 0.1955500
## 8 100 0.5974887 0.1950032
## 8 250 0.5993650 0.1987721
## 8 500 0.5953374 0.1907064
## 9 100 0.5931927 0.1864143
## 9 250 0.5956113 0.1912601
## 9 500 0.5972188 0.1944705
## 10 100 0.5910461 0.1821271
## 10 250 0.5940020 0.1880441
## 10 500 0.5974858 0.1950077
## 11 100 0.5956106 0.1912455
## 11 250 0.5958776 0.1917976
## 11 500 0.5939998 0.1880398
## 12 100 0.5934615 0.1869561
## 12 250 0.6001751 0.2003917
## 12 500 0.5980249 0.1960847
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
## and min.node.size = 250.
plot(matches.rf3a)
With ‘ranger’ package the best results are obtained for mtry = 4 and min.node.size = 250.
pred.rf.train <- predict(matches.rf,
matches.train,
type = "prob")[, "yes"]
ROC.rf.train <- roc(as.numeric(matches.train$home == "yes"),
pred.rf.train)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
pred.rf.test <- predict(matches.rf,
matches.test,
type = "prob")[, "yes"]
ROC.rf.test <- roc(as.numeric(matches.test$home == "yes"),
pred.rf.test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
pred.rf2.train <- predict(matches.rf2,
matches.train,
type = "prob")[, "yes"]
ROC.rf2.train <- roc(as.numeric(matches.train$home == "yes"),
pred.rf2.train)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
pred.rf2.test <- predict(matches.rf2,
matches.test,
type = "prob")[, "yes"]
ROC.rf2.test <- roc(as.numeric(matches.test$home == "yes"),
pred.rf2.test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
pred.rf3.train <- predict(matches.rf3,
matches.train,
type = "prob")[, "yes"]
ROC.rf3.train <- roc(as.numeric(matches.train$home == "yes"),
pred.rf3.train)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
pred.rf3.test <- predict(matches.rf3,
matches.test,
type = "prob")[, "yes"]
ROC.rf3.test <- roc(as.numeric(matches.test$home == "yes"),
pred.rf3.test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
pred.rf3a.train <- predict(matches.rf3a,
matches.train,
type = "prob")[, "yes"]
ROC.rf3a.train <- roc(as.numeric(matches.train$home == "yes"),
pred.rf3a.train)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
pred.rf3a.test <- predict(matches.rf3a,
matches.test,
type = "prob")[, "yes"]
ROC.rf3a.test <- roc(as.numeric(matches.test$home == "yes"),
pred.rf3a.test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
list(
ROC.rf.train = ROC.rf.train,
ROC.rf.test = ROC.rf.test,
ROC.rf2.train = ROC.rf2.train,
ROC.rf2.test = ROC.rf2.test,
ROC.rf3.train = ROC.rf3.train,
ROC.rf3.test = ROC.rf3.test,
ROC.rf3a.train = ROC.rf3a.train,
ROC.rf3a.test = ROC.rf3a.test
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"rf = ",
round(100 * (2 * auc(ROC.rf.train) - 1), 1), "%, ",
"rf2 = ",
round(100 * (2 * auc(ROC.rf2.train) - 1), 1), "%, ",
"rf3 = ",
round(100 * (2 * auc(ROC.rf3.train) - 1), 1), "%, ",
"rf3a = ",
round(100 * (2 * auc(ROC.rf3a.train) - 1), 1), "%, ",
"\nGini TEST: ",
"rf = ",
round(100 * (2 * auc(ROC.rf.test) - 1), 1), "%, ",
"rf2 = ",
round(100 * (2 * auc(ROC.rf2.test) - 1), 1), "%, ",
"rf3 = ",
round(100 * (2 * auc(ROC.rf3.test) - 1), 1), "%, ",
"rf3a = ",
round(100 * (2 * auc(ROC.rf3a.test) - 1), 1), "% "
)) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
Model built with the default values (rf) is overtrained. The best results for the testing sample gives the model rf2 – with parameters:
pred.rf2.test <- predict(matches.rf2,
matches.test,
type = "response")
confusionMatrix.rf2.test = confusionMatrix(data = pred.rf2.test, # predictions
# actual values
reference = as.factor(matches.test$home),
# definitions of the "success" label
positive = "yes")
varImpPlot(matches.rf2,
sort = TRUE,
main = "Importance of predicors",
n.var = 12,
type = 1) # mean decrease in accuracy
For this model the most important variables are almost the same as for the previously developed ones.
The next utilized ML method is Extreme Gradient Boosting (XGBoost)
parameters_xgb <- expand.grid(nrounds = seq(20, 80, 10),
max_depth = c(8),
eta = c(0.25),
gamma = 1,
colsample_bytree = c(0.2),
min_child_weight = c(50),
subsample = 0.8)
ctrl_cv3 <- trainControl(method = "cv",
number = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(123456789)
matches.xgb <- train(model1.formula,
data = matches.train,
method = "xgbTree",
trControl = ctrl_cv3,
tuneGrid = parameters_xgb)
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [20:12:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
matches.xgb
## eXtreme Gradient Boosting
##
## 3724 samples
## 12 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 2483, 2482, 2483
## Resampling results across tuning parameters:
##
## nrounds ROC Sens Spec
## 20 0.6270362 0.6074039 0.5762662
## 30 0.6324540 0.6122409 0.5730430
## 40 0.6332113 0.6116981 0.5746515
## 50 0.6338745 0.6095579 0.5708933
## 60 0.6329422 0.6031046 0.5816330
## 70 0.6312691 0.6009549 0.5730334
## 80 0.6311208 0.5998805 0.5810971
##
## Tuning parameter 'max_depth' was held constant at a value of 8
## Tuning
## parameter 'min_child_weight' was held constant at a value of 50
##
## Tuning parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 8, eta
## = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 50 and
## subsample = 0.8.
pred.xgb.train = predict(matches.xgb, matches.train, type = 'raw')
confusionMatrix(data = pred.xgb.train, # predictions
# actual values
reference = as.factor(matches.train$home),
# definitions of the "success" label
positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1209 724
## yes 653 1138
##
## Accuracy : 0.6302
## 95% CI : (0.6145, 0.6458)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.2605
##
## Mcnemar's Test P-Value : 0.05924
##
## Sensitivity : 0.6112
## Specificity : 0.6493
## Pos Pred Value : 0.6354
## Neg Pred Value : 0.6255
## Prevalence : 0.5000
## Detection Rate : 0.3056
## Detection Prevalence : 0.4809
## Balanced Accuracy : 0.6302
##
## 'Positive' Class : yes
##
The accuracy is 0.63 – slightly better than for the previous models.
max_depth and min_child_weight
parameters_xgb2 <- expand.grid(nrounds = 50,
max_depth = seq(5, 15, 2),
eta = c(0.25),
gamma = 1,
colsample_bytree = c(0.2),
min_child_weight = seq(50, 500, 50),
subsample = 0.8)
set.seed(123456789)
matches.xgb2 <- train(model1.formula,
data = matches.train,
method = "xgbTree",
trControl = ctrl_cv3,
tuneGrid = parameters_xgb2)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
matches.xgb2
## eXtreme Gradient Boosting
##
## 3724 samples
## 12 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 2483, 2482, 2483
## Resampling results across tuning parameters:
##
## max_depth min_child_weight ROC Sens Spec
## 5 50 0.6378535 0.6084723 0.5800184
## 5 100 0.6389321 0.6052586 0.5821732
## 5 150 0.6366388 0.6127725 0.5902317
## 5 200 0.6363816 0.6047106 0.5923788
## 5 250 0.5341034 0.5402802 0.5279267
## 5 300 0.5000000 0.3333333 0.6666667
## 5 350 0.5000000 1.0000000 0.0000000
## 5 400 0.5000000 0.3333333 0.6666667
## 5 450 0.5000000 0.6666667 0.3333333
## 5 500 0.5000000 0.3333333 0.6666667
## 7 50 0.6362183 0.6074065 0.5837809
## 7 100 0.6412749 0.6025722 0.6068732
## 7 150 0.6373810 0.6127682 0.5870067
## 7 200 0.6341959 0.6073970 0.5843056
## 7 250 0.5343455 0.5928783 0.4586982
## 7 300 0.5000000 0.6666667 0.3333333
## 7 350 0.5000000 0.0000000 1.0000000
## 7 400 0.5000000 0.3333333 0.6666667
## 7 450 0.5000000 0.3333333 0.6666667
## 7 500 0.5000000 0.3333333 0.6666667
## 9 50 0.6317361 0.6020241 0.5735797
## 9 100 0.6415965 0.5982823 0.6084861
## 9 150 0.6365760 0.6090065 0.5810962
## 9 200 0.6349720 0.5993515 0.5902317
## 9 250 0.5247139 0.6873548 0.3620730
## 9 300 0.5000000 1.0000000 0.0000000
## 9 350 0.5000000 0.6666667 0.3333333
## 9 400 0.5000000 0.6666667 0.3333333
## 9 450 0.5000000 0.3333333 0.6666667
## 9 500 0.5000000 0.3333333 0.6666667
## 11 50 0.6326365 0.5993429 0.5848675
## 11 100 0.6386517 0.5988148 0.5966694
## 11 150 0.6345797 0.6111518 0.5896975
## 11 200 0.6329684 0.5982745 0.5886283
## 11 250 0.5367916 0.5434947 0.5300885
## 11 300 0.5000000 0.6666667 0.3333333
## 11 350 0.5000000 0.3333333 0.6666667
## 11 400 0.5000000 0.3333333 0.6666667
## 11 450 0.5000000 0.3333333 0.6666667
## 11 500 0.5000000 0.3333333 0.6666667
## 13 50 0.6366088 0.6009532 0.5821706
## 13 100 0.6416538 0.5982780 0.5923762
## 13 150 0.6360044 0.6090021 0.5870093
## 13 200 0.6340803 0.5988113 0.5923788
## 13 250 0.5423311 0.5413537 0.5327654
## 13 300 0.5000000 0.3333333 0.6666667
## 13 350 0.5000000 0.6666667 0.3333333
## 13 400 0.5000000 0.0000000 1.0000000
## 13 450 0.5000000 0.6666667 0.3333333
## 13 500 0.5000000 0.6666667 0.3333333
## 15 50 0.6359817 0.6090134 0.5816373
## 15 100 0.6391035 0.6025670 0.5907710
## 15 150 0.6358928 0.6165290 0.5768047
## 15 200 0.6354595 0.5971967 0.5848501
## 15 250 0.5367916 0.5434947 0.5300885
## 15 300 0.5000000 1.0000000 0.0000000
## 15 350 0.5000000 0.6666667 0.3333333
## 15 400 0.5000000 0.0000000 1.0000000
## 15 450 0.5000000 0.3333333 0.6666667
## 15 500 0.5000000 0.3333333 0.6666667
##
## Tuning parameter 'nrounds' was held constant at a value of 50
## Tuning
## 'colsample_bytree' was held constant at a value of 0.2
## Tuning
## parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 13, eta
## = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 100
## and subsample = 0.8.
The final values used for the model were nrounds = 50, max_depth = 13, eta = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 100 and subsample = 0.8.
Now colsample_bytree:
parameters_xgb3 <- expand.grid(nrounds = 50,
max_depth = 13,
eta = c(0.25),
gamma = 1,
colsample_bytree = seq(0.1, 0.8, 0.1),
min_child_weight = 100,
subsample = 0.8)
set.seed(123456789)
matches.xgb3 <- train(model1.formula,
data = matches.train,
method = "xgbTree",
trControl = ctrl_cv3,
tuneGrid = parameters_xgb3)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
matches.xgb3
## eXtreme Gradient Boosting
##
## 3724 samples
## 12 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 2483, 2482, 2483
## Resampling results across tuning parameters:
##
## colsample_bytree ROC Sens Spec
## 0.1 0.6390969 0.6170762 0.5757329
## 0.2 0.6397274 0.6090264 0.5929086
## 0.3 0.6400269 0.6004190 0.5896958
## 0.4 0.6392504 0.6014934 0.5998944
## 0.5 0.6399506 0.6057945 0.5929164
## 0.6 0.6391113 0.5993438 0.6020363
## 0.7 0.6368069 0.5934419 0.5891529
## 0.8 0.6368582 0.5961266 0.5945310
##
## Tuning parameter 'nrounds' was held constant at a value of 50
## Tuning
## 'min_child_weight' was held constant at a value of 100
## Tuning
## parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 13, eta
## = 0.25, gamma = 1, colsample_bytree = 0.3, min_child_weight = 100
## and subsample = 0.8.
The final values used for the model were nrounds = 50, max_depth = 13, eta = 0.25, gamma = 1, colsample_bytree = 0.3, min_child_weight = 100 and subsample = 0.8.
Now subsample:
parameters_xgb4 <- expand.grid(nrounds = 50,
max_depth = 13,
eta = c(0.25),
gamma = 1,
colsample_bytree = 0.3,
min_child_weight = 100,
subsample = c(0.6, 0.7, 0.75, 0.8, 0.85, 0.9))
set.seed(123456789)
matches.xgb4 <- train(model1.formula,
data = matches.train,
method = "xgbTree",
trControl = ctrl_cv3,
tuneGrid = parameters_xgb4)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
matches.xgb4
## eXtreme Gradient Boosting
##
## 3724 samples
## 12 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 2483, 2482, 2483
## Resampling results across tuning parameters:
##
## subsample ROC Sens Spec
## 0.60 0.6377160 0.6111518 0.5810910
## 0.70 0.6356003 0.5902170 0.5988105
## 0.75 0.6409676 0.6057936 0.5886205
## 0.80 0.6409486 0.5880647 0.6084835
## 0.85 0.6399447 0.6117085 0.5939899
## 0.90 0.6408472 0.6084870 0.5929077
##
## Tuning parameter 'nrounds' was held constant at a value of 50
## Tuning
## 'colsample_bytree' was held constant at a value of 0.3
## Tuning
## parameter 'min_child_weight' was held constant at a value of 100
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 13, eta
## = 0.25, gamma = 1, colsample_bytree = 0.3, min_child_weight = 100
## and subsample = 0.75.
The final values used for the model were nrounds = 50, max_depth = 13, eta = 0.25, gamma = 1, colsample_bytree = 0.3, min_child_weight = 100 and subsample = 0.75.
Now reduce learning rate eta and take higher number of rounds:
parameters_xgb5 <- expand.grid(nrounds = 100,
max_depth = 13,
eta = 0.12,
gamma = 1,
colsample_bytree = 0.3,
min_child_weight = 100,
subsample = 0.75)
set.seed(123456789)
matches.xgb5 <- train(model1.formula,
data = matches.train,
method = "xgbTree",
trControl = ctrl_cv3,
tuneGrid = parameters_xgb5)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
matches.xgb5
## eXtreme Gradient Boosting
##
## 3724 samples
## 12 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 2483, 2482, 2483
## Resampling results:
##
## ROC Sens Spec
## 0.6416779 0.609551 0.5934558
##
## Tuning parameter 'nrounds' was held constant at a value of 100
## Tuning
## parameter 'min_child_weight' was held constant at a value of 100
##
## Tuning parameter 'subsample' was held constant at a value of 0.75
ROC value is slightly higher.
Once again – higher nrounds and lower eta:
parameters_xgb6 <- expand.grid(nrounds = 200,
max_depth = 13,
eta = 0.06,
gamma = 1,
colsample_bytree = 0.3,
min_child_weight = 100,
subsample = 0.75)
set.seed(123456789)
matches.xgb6 <- train(model1.formula,
data = matches.train,
method = "xgbTree",
trControl = ctrl_cv3,
tuneGrid = parameters_xgb6)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
matches.xgb6
## eXtreme Gradient Boosting
##
## 3724 samples
## 12 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 2483, 2482, 2483
## Resampling results:
##
## ROC Sens Spec
## 0.64145 0.605256 0.5993619
##
## Tuning parameter 'nrounds' was held constant at a value of 200
## Tuning
## parameter 'min_child_weight' was held constant at a value of 100
##
## Tuning parameter 'subsample' was held constant at a value of 0.75
This time ROC is lowered.
source(here("getAccuracyAndGini.R"))
(models <- c("", "2":"6"))
## [1] "" "2" "3" "4" "5" "6"
sapply(paste0("matches.xgb", models),
function(x) getAccuracyAndGini(model = get(x),
data = matches.test,
target_variable = "home",
predicted_class = "yes")
)
## matches.xgb matches.xgb2 matches.xgb3 matches.xgb4 matches.xgb5
## Accuracy 0.5971178 0.5989975 0.6121554 0.5996241 0.6090226
## Sensitivity 0.5639098 0.5601504 0.5689223 0.5614035 0.5839599
## Specificity 0.6303258 0.6378446 0.6553885 0.6378446 0.6340852
## Gini 0.2632772 0.2724292 0.2663394 0.2714006 0.2723805
## matches.xgb6
## Accuracy 0.6058897
## Sensitivity 0.5814536
## Specificity 0.6303258
## Gini 0.2732144
For the test data the best results in terms of Gini coefficient are obtained with the last trained model – matches.xgb6.
ROC.xgb6.train <- pROC::roc(matches.train$home,
predict(matches.xgb6,
matches.train, type = "prob")[, "yes"])
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
ROC.xgb6.test <- pROC::roc(matches.test$home,
predict(matches.xgb6,
matches.test, type = "prob")[, "yes"])
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
ROC.xgb6.train <- pROC::roc(matches.train$home,
predict(matches.xgb6,
matches.train,
type = "prob",
n.trees = 500)[, "yes"])
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
ROC.xgb6.test <- pROC::roc(matches.test$home,
predict(matches.xgb6,
matches.test,
type = "prob",
n.trees = 500)[, "yes"])
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
list(
ROC.xgb6.train = ROC.xgb6.train,
ROC.xgb6.test = ROC.xgb6.test
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"xgb6 = ",
round(100 * (2 * auc(ROC.xgb6.train) - 1), 1), "%, ",
"Gini TEST: ",
"xgb6 = ",
round(100 * (2 * auc(ROC.xgb6.test) - 1), 1), "% ")) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
So far XGB gives the best values of Gini coefficient out of all tested ML methods. However, it is still not better than the classical logistic regression.
A comparison of predicted and actual values can be also presented in terms of probability distribution function. We see below that predictions for ‘yes’ and ‘no,’ which is the answer to the question of being a match host, are not sharply separated. They have the most of their mass in the region of lower probability for answer ‘no’ and in the region of higher probability for answer ‘yes,’ however big parts of these distribution overlap.
pred.xgb6.test = predict(matches.xgb6,
matches.test, type = "raw")
confusionMatrix.xgb6.test = confusionMatrix(data = pred.xgb6.test, # predictions
# actual values
reference = as.factor(matches.test$home),
# definitions of the "success" label
positive = "yes")
tibble(
label = matches.train$home,
pred = predict(matches.xgb6, matches.train, reshape = TRUE, type = 'prob')[, 'yes']
) %>%
ggplot(aes(pred, fill = label)) +
geom_density(alpha = 0.4) +
labs(
title = "Predictions distribution vs. target variable",
subtitle = "train sample",
caption = "source: own calculations with the xgboost model"
)
The same case can be observed for the test sample:
tibble(
label = matches.test$home,
pred = predict(matches.xgb6, matches.test, reshape = TRUE, type = 'prob')[, 'yes']
) %>%
ggplot(aes(pred, fill = label)) +
geom_density(alpha = 0.4) +
labs(
title = "Predictions distribution vs. target variable",
subtitle = "test sample",
caption = "source: own calculations with the xgboost model"
)
The last utilized method is the neural network approach. Two implementations are considered: neuralnet and keras packages.
Neural networks algorithms require the data to be properly scaled. It must also be transformed into numeric values, not factor, as in the case of the dependent variable.
str(matches.train)
## tibble [3,724 x 13] (S3: tbl_df/tbl/data.frame)
## $ GoalsHT : num [1:3724] 0 0 0 0 0 1 1 1 2 0 ...
## $ GoalsFT : num [1:3724] 1 0 1 0 1 2 2 2 2 0 ...
## $ BallPos : num [1:3724] 0.6 0.51 0.49 0.63 0.37 0.58 0.42 0.37 0.63 0.47 ...
## $ ShotsOffTarget: num [1:3724] 5 7 3 4 4 5 2 5 5 10 ...
## $ ShotsOnTarget : num [1:3724] 5 6 4 2 1 5 2 3 3 4 ...
## $ BlockedShots : num [1:3724] 4 6 4 6 2 0 3 3 5 4 ...
## $ Corners : num [1:3724] 4 8 9 2 8 6 3 3 6 8 ...
## $ PassSuccPerc : num [1:3724] 0.86 0.77 0.76 0.84 0.68 0.8 0.75 0.77 0.84 0.83 ...
## $ AerialsWon : num [1:3724] 20 30 15 30 9 16 15 27 14 15 ...
## $ Fouls : num [1:3724] 14 10 10 14 9 18 9 16 10 12 ...
## $ YellowCards : num [1:3724] 2 1 2 0 3 3 1 1 1 2 ...
## $ RedCards : num [1:3724] 0 0 0 0 0 0 0 0 0 0 ...
## $ home : Factor w/ 2 levels "no","yes": 2 2 1 2 1 2 1 2 1 2 ...
matches.train.nn = matches.train
matches.test.nn = matches.test
str(matches.train.nn)
## tibble [3,724 x 13] (S3: tbl_df/tbl/data.frame)
## $ GoalsHT : num [1:3724] 0 0 0 0 0 1 1 1 2 0 ...
## $ GoalsFT : num [1:3724] 1 0 1 0 1 2 2 2 2 0 ...
## $ BallPos : num [1:3724] 0.6 0.51 0.49 0.63 0.37 0.58 0.42 0.37 0.63 0.47 ...
## $ ShotsOffTarget: num [1:3724] 5 7 3 4 4 5 2 5 5 10 ...
## $ ShotsOnTarget : num [1:3724] 5 6 4 2 1 5 2 3 3 4 ...
## $ BlockedShots : num [1:3724] 4 6 4 6 2 0 3 3 5 4 ...
## $ Corners : num [1:3724] 4 8 9 2 8 6 3 3 6 8 ...
## $ PassSuccPerc : num [1:3724] 0.86 0.77 0.76 0.84 0.68 0.8 0.75 0.77 0.84 0.83 ...
## $ AerialsWon : num [1:3724] 20 30 15 30 9 16 15 27 14 15 ...
## $ Fouls : num [1:3724] 14 10 10 14 9 18 9 16 10 12 ...
## $ YellowCards : num [1:3724] 2 1 2 0 3 3 1 1 1 2 ...
## $ RedCards : num [1:3724] 0 0 0 0 0 0 0 0 0 0 ...
## $ home : Factor w/ 2 levels "no","yes": 2 2 1 2 1 2 1 2 1 2 ...
matches.train.nn$home = as.numeric(matches.train.nn$home)-1
str(matches.test.nn)
## tibble [1,596 x 13] (S3: tbl_df/tbl/data.frame)
## $ GoalsHT : num [1:1596] 1 0 0 1 1 0 1 0 0 1 ...
## $ GoalsFT : num [1:1596] 2 1 0 1 1 1 3 0 0 3 ...
## $ BallPos : num [1:1596] 0.4 0.53 0.44 0.39 0.55 0.46 0.53 0.64 0.72 0.64 ...
## $ ShotsOffTarget: num [1:1596] 0 2 9 6 3 3 4 3 5 4 ...
## $ ShotsOnTarget : num [1:1596] 4 4 0 2 5 4 4 2 2 8 ...
## $ BlockedShots : num [1:1596] 1 4 3 1 2 2 1 2 9 6 ...
## $ Corners : num [1:1596] 0 5 3 4 2 6 6 6 11 9 ...
## $ PassSuccPerc : num [1:1596] 0.8 0.8 0.83 0.7 0.86 0.73 0.86 0.82 0.88 0.86 ...
## $ AerialsWon : num [1:1596] 10 12 14 27 19 27 7 15 13 15 ...
## $ Fouls : num [1:1596] 20 10 8 6 14 14 13 14 14 13 ...
## $ YellowCards : num [1:1596] 4 0 1 1 2 1 1 2 2 1 ...
## $ RedCards : num [1:1596] 0 1 0 0 0 0 0 0 0 0 ...
## $ home : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 1 2 1 ...
matches.test.nn$home = as.numeric(matches.test.nn$home)-1
min <- apply(matches.train.nn, 2, min)
max <- apply(matches.train.nn, 2, max)
matches.train.nn <- scale(matches.train.nn, center = min, scale = max)
matches.test.nn <- scale(matches.test.nn, center = min, scale = max)
matches.train.nn = as.data.frame(matches.train.nn)
matches.test.nn = as.data.frame(matches.test.nn)
The first network contains 1 hidden layer with 10 neurons.
library(neuralnet)
set.seed(12345678)
matches.nn = neuralnet(model1.formula,
data = as.data.frame(matches.train.nn),
hidden = c(10), # number of neurons in hidden layers
linear.output = FALSE, # T for regression, F for classification
learningrate.limit = NULL,
learningrate.factor = list(minus = 0.5, plus = 1.2),
algorithm = "rprop+",
threshold = 0.01,
stepmax = 200000,
lifesign = 'full',
rep = 1)
## hidden: 10 thresh: 0.01 rep: 1/1 steps: 1000 min thresh: 0.285513111171468
## 2000 min thresh: 0.285513111171468
## 3000 min thresh: 0.285513111171468
## 4000 min thresh: 0.285513111171468
## 5000 min thresh: 0.269794093488197
## 6000 min thresh: 0.196967865838763
## 7000 min thresh: 0.14297333122751
## 8000 min thresh: 0.14297333122751
## 9000 min thresh: 0.135969245297159
## 10000 min thresh: 0.103579424269917
## 11000 min thresh: 0.0843860014251815
## 12000 min thresh: 0.0781392840311353
## 13000 min thresh: 0.063145855219901
## 14000 min thresh: 0.0611569693078066
## 15000 min thresh: 0.0551528157451127
## 16000 min thresh: 0.0492557118794925
## 17000 min thresh: 0.0447836494391428
## 18000 min thresh: 0.0391116232910056
## 19000 min thresh: 0.0364520117790527
## 20000 min thresh: 0.0333051800820802
## 21000 min thresh: 0.0315163732418564
## 22000 min thresh: 0.0312802554254512
## 23000 min thresh: 0.0269206408057176
## 24000 min thresh: 0.0239715346379435
## 25000 min thresh: 0.0239715346379435
## 26000 min thresh: 0.0238521209239861
## 27000 min thresh: 0.0210035388631637
## 28000 min thresh: 0.0210035388631637
## 29000 min thresh: 0.0201386636930678
## 30000 min thresh: 0.0201386636930678
## 31000 min thresh: 0.0198222357815766
## 32000 min thresh: 0.0185371729384217
## 33000 min thresh: 0.0177756417029591
## 34000 min thresh: 0.0166162361505436
## 35000 min thresh: 0.0159814516745024
## 36000 min thresh: 0.0152430003802028
## 37000 min thresh: 0.0145589465698915
## 38000 min thresh: 0.0143316821065908
## 39000 min thresh: 0.0139635498404982
## 40000 min thresh: 0.0130249903399839
## 41000 min thresh: 0.0128994947710178
## 42000 min thresh: 0.0128125445957665
## 43000 min thresh: 0.0120673549797706
## 44000 min thresh: 0.0109587565087845
## 45000 min thresh: 0.0104650116045324
## 46000 min thresh: 0.0104650116045324
## 47000 min thresh: 0.0103352768924121
## 47023 error: 384.08076 time: 1.56 mins
matches.nn$result.matrix
## [,1]
## error 3.840808e+02
## reached.threshold 9.729959e-03
## steps 4.702300e+04
## Intercept.to.1layhid1 6.458586e+01
## GoalsHT.to.1layhid1 4.991633e+01
## GoalsFT.to.1layhid1 -6.020379e+01
## BallPos.to.1layhid1 9.043190e+01
## ShotsOffTarget.to.1layhid1 1.184209e+01
## ShotsOnTarget.to.1layhid1 1.146165e+01
## BlockedShots.to.1layhid1 4.472901e+01
## Corners.to.1layhid1 -6.528187e+01
## PassSuccPerc.to.1layhid1 -1.293327e+02
## AerialsWon.to.1layhid1 -5.262872e+01
## Fouls.to.1layhid1 -9.146615e+01
## YellowCards.to.1layhid1 1.189128e+02
## RedCards.to.1layhid1 -1.655122e+01
## Intercept.to.1layhid2 -6.678545e+00
## GoalsHT.to.1layhid2 2.263172e+02
## GoalsFT.to.1layhid2 -7.845360e+02
## BallPos.to.1layhid2 -1.423712e+01
## ShotsOffTarget.to.1layhid2 6.132896e+01
## ShotsOnTarget.to.1layhid2 2.968429e+02
## BlockedShots.to.1layhid2 -5.476262e+02
## Corners.to.1layhid2 1.328246e+01
## PassSuccPerc.to.1layhid2 -3.304635e+02
## AerialsWon.to.1layhid2 -7.592418e+01
## Fouls.to.1layhid2 2.120929e+02
## YellowCards.to.1layhid2 -3.123988e+02
## RedCards.to.1layhid2 1.225317e+00
## Intercept.to.1layhid3 8.480925e+01
## GoalsHT.to.1layhid3 -9.710812e+02
## GoalsFT.to.1layhid3 6.696841e+00
## BallPos.to.1layhid3 -1.447113e+02
## ShotsOffTarget.to.1layhid3 -1.258127e+02
## ShotsOnTarget.to.1layhid3 8.177245e+01
## BlockedShots.to.1layhid3 -5.703170e+02
## Corners.to.1layhid3 2.974772e+02
## PassSuccPerc.to.1layhid3 2.113029e+01
## AerialsWon.to.1layhid3 -2.978337e+02
## Fouls.to.1layhid3 -1.240649e+02
## YellowCards.to.1layhid3 4.268320e+01
## RedCards.to.1layhid3 2.869099e+02
## Intercept.to.1layhid4 -2.526956e+01
## GoalsHT.to.1layhid4 -7.189838e+02
## GoalsFT.to.1layhid4 1.140839e+01
## BallPos.to.1layhid4 2.046329e+02
## ShotsOffTarget.to.1layhid4 3.353699e+02
## ShotsOnTarget.to.1layhid4 4.884052e+01
## BlockedShots.to.1layhid4 -3.264203e+02
## Corners.to.1layhid4 1.875049e+02
## PassSuccPerc.to.1layhid4 3.764271e+01
## AerialsWon.to.1layhid4 1.427801e+02
## Fouls.to.1layhid4 -9.190623e+01
## YellowCards.to.1layhid4 -2.984296e+02
## RedCards.to.1layhid4 -5.654870e+01
## Intercept.to.1layhid5 6.546110e+01
## GoalsHT.to.1layhid5 4.445189e+02
## GoalsFT.to.1layhid5 1.395439e+02
## BallPos.to.1layhid5 -1.294861e+01
## ShotsOffTarget.to.1layhid5 -1.029155e+02
## ShotsOnTarget.to.1layhid5 1.356225e+00
## BlockedShots.to.1layhid5 -1.708471e+02
## Corners.to.1layhid5 1.982213e+02
## PassSuccPerc.to.1layhid5 2.528641e+02
## AerialsWon.to.1layhid5 -2.754776e+02
## Fouls.to.1layhid5 -4.098509e+01
## YellowCards.to.1layhid5 9.375047e+01
## RedCards.to.1layhid5 -1.303938e+02
## Intercept.to.1layhid6 5.978276e+00
## GoalsHT.to.1layhid6 -9.942881e+00
## GoalsFT.to.1layhid6 1.916425e+00
## BallPos.to.1layhid6 3.548889e-01
## ShotsOffTarget.to.1layhid6 -1.287723e+00
## ShotsOnTarget.to.1layhid6 5.520516e+00
## BlockedShots.to.1layhid6 -9.598037e+00
## Corners.to.1layhid6 -7.998171e+00
## PassSuccPerc.to.1layhid6 1.710620e+00
## AerialsWon.to.1layhid6 1.627903e-01
## Fouls.to.1layhid6 -8.670247e+00
## YellowCards.to.1layhid6 1.737480e+00
## RedCards.to.1layhid6 -8.048826e-01
## Intercept.to.1layhid7 5.043153e+01
## GoalsHT.to.1layhid7 1.728859e+01
## GoalsFT.to.1layhid7 2.779677e+02
## BallPos.to.1layhid7 -5.165693e+02
## ShotsOffTarget.to.1layhid7 2.205352e+02
## ShotsOnTarget.to.1layhid7 7.241200e+01
## BlockedShots.to.1layhid7 -5.592941e+01
## Corners.to.1layhid7 3.535524e+01
## PassSuccPerc.to.1layhid7 9.388156e+01
## AerialsWon.to.1layhid7 -1.484819e+02
## Fouls.to.1layhid7 1.477469e+02
## YellowCards.to.1layhid7 -2.616297e+02
## RedCards.to.1layhid7 4.133430e+01
## Intercept.to.1layhid8 8.879912e+01
## GoalsHT.to.1layhid8 -3.321490e+01
## GoalsFT.to.1layhid8 -2.418978e+01
## BallPos.to.1layhid8 -8.322425e+01
## ShotsOffTarget.to.1layhid8 -8.661314e+01
## ShotsOnTarget.to.1layhid8 -3.653862e+02
## BlockedShots.to.1layhid8 -3.842196e+01
## Corners.to.1layhid8 -9.108920e+01
## PassSuccPerc.to.1layhid8 1.364117e+02
## AerialsWon.to.1layhid8 4.068274e+02
## Fouls.to.1layhid8 3.341251e+02
## YellowCards.to.1layhid8 -1.928375e+02
## RedCards.to.1layhid8 -1.869374e+00
## Intercept.to.1layhid9 8.342168e+00
## GoalsHT.to.1layhid9 1.748306e+01
## GoalsFT.to.1layhid9 -1.855226e+01
## BallPos.to.1layhid9 2.165517e+01
## ShotsOffTarget.to.1layhid9 -3.549926e+00
## ShotsOnTarget.to.1layhid9 1.040237e+01
## BlockedShots.to.1layhid9 2.676200e+00
## Corners.to.1layhid9 -8.214676e+00
## PassSuccPerc.to.1layhid9 -1.235792e+01
## AerialsWon.to.1layhid9 -3.037610e+00
## Fouls.to.1layhid9 -2.474463e+01
## YellowCards.to.1layhid9 2.042643e+01
## RedCards.to.1layhid9 1.426404e+00
## Intercept.to.1layhid10 9.603258e+01
## GoalsHT.to.1layhid10 1.559129e+02
## GoalsFT.to.1layhid10 1.231473e+03
## BallPos.to.1layhid10 -2.197492e+02
## ShotsOffTarget.to.1layhid10 5.524626e+01
## ShotsOnTarget.to.1layhid10 2.376378e+02
## BlockedShots.to.1layhid10 -2.601905e+01
## Corners.to.1layhid10 2.060538e+02
## PassSuccPerc.to.1layhid10 1.287520e+02
## AerialsWon.to.1layhid10 -2.857632e+02
## Fouls.to.1layhid10 -9.967303e+01
## YellowCards.to.1layhid10 9.754119e+01
## RedCards.to.1layhid10 1.245566e+01
## Intercept.to.home -5.396550e-01
## 1layhid1.to.home -1.948927e+00
## 1layhid2.to.home -2.554309e+00
## 1layhid3.to.home -2.065735e+00
## 1layhid4.to.home 9.011349e-01
## 1layhid5.to.home -1.879732e+00
## 1layhid6.to.home -1.205966e+00
## 1layhid7.to.home 9.453114e-01
## 1layhid8.to.home -1.146854e+00
## 1layhid9.to.home 2.536890e+00
## 1layhid10.to.home 2.778417e+00
plot(matches.nn, rep = "best", arrow.length = 0.15, information = T)
pred.nn.train <- predict(matches.nn,
matches.train,
type = "response")
nn.train.confMat = confusionMatrix(data = as.factor(ifelse(pred.nn.train > 0.5, 1, 0)), # predictions
# actual values
reference = as.factor(matches.train.nn$home),
# definitions of the "success" label
positive = "1")
nn.train.confMat
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1261 1044
## 1 601 818
##
## Accuracy : 0.5583
## 95% CI : (0.5421, 0.5743)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 6.087e-13
##
## Kappa : 0.1165
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4393
## Specificity : 0.6772
## Pos Pred Value : 0.5765
## Neg Pred Value : 0.5471
## Prevalence : 0.5000
## Detection Rate : 0.2197
## Detection Prevalence : 0.3810
## Balanced Accuracy : 0.5583
##
## 'Positive' Class : 1
##
pred.nn.test <- predict(matches.nn,
matches.test,
type = "response")
confusionMatrix.nn.test = confusionMatrix(data = as.factor(ifelse(pred.nn.test > 0.5, 1, 0)), # predictions
# actual values
reference = as.factor(matches.test.nn$home),
# definitions of the "success" label
positive = "1")
confusionMatrix.nn.test
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 544 445
## 1 254 353
##
## Accuracy : 0.562
## 95% CI : (0.5373, 0.5866)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 3.963e-07
##
## Kappa : 0.1241
##
## Mcnemar's Test P-Value : 6.649e-13
##
## Sensitivity : 0.4424
## Specificity : 0.6817
## Pos Pred Value : 0.5815
## Neg Pred Value : 0.5501
## Prevalence : 0.5000
## Detection Rate : 0.2212
## Detection Prevalence : 0.3803
## Balanced Accuracy : 0.5620
##
## 'Positive' Class : 1
##
Obtained accuracy is 0.55 for training and 0.57 (higher!) for test data set.
Lets check some bigger neural network.
Note: Threshold had to be raised because it was not possible to achieve convergence.
set.seed(12345678)
matches.nn.bigger = neuralnet(model1.formula,
data = as.data.frame(matches.train.nn),
hidden = c(10, 5, 3), # number of neurons in hidden layers
linear.output = FALSE, # T for regression, F for classification
learningrate.limit = NULL,
learningrate.factor = list(minus = 0.5, plus = 1.2),
algorithm = "rprop+",
threshold = 0.1,
stepmax = 200000,
lifesign = 'full',
rep = 1)
## hidden: 10, 5, 3 thresh: 0.1 rep: 1/1 steps: 1000 min thresh: 0.258841055656181
## 2000 min thresh: 0.251902633983841
## 3000 min thresh: 0.251902633983841
## 4000 min thresh: 0.251151331563959
## 5000 min thresh: 0.22200807337225
## 6000 min thresh: 0.214703562576889
## 7000 min thresh: 0.214703562576889
## 8000 min thresh: 0.185039180033865
## 9000 min thresh: 0.173411899039052
## 10000 min thresh: 0.173411899039052
## 11000 min thresh: 0.173411899039052
## 12000 min thresh: 0.173411899039052
## 13000 min thresh: 0.173411899039052
## 14000 min thresh: 0.173411899039052
## 15000 min thresh: 0.173411899039052
## 16000 min thresh: 0.173411899039052
## 17000 min thresh: 0.173411899039052
## 18000 min thresh: 0.173411899039052
## 19000 min thresh: 0.173411899039052
## 20000 min thresh: 0.173411899039052
## 21000 min thresh: 0.173411899039052
## 22000 min thresh: 0.173411899039052
## 23000 min thresh: 0.173411899039052
## 24000 min thresh: 0.173411899039052
## 25000 min thresh: 0.173411899039052
## 26000 min thresh: 0.173411899039052
## 27000 min thresh: 0.173411899039052
## 28000 min thresh: 0.173411899039052
## 29000 min thresh: 0.173411899039052
## 30000 min thresh: 0.173411899039052
## 31000 min thresh: 0.173411899039052
## 32000 min thresh: 0.173411899039052
## 33000 min thresh: 0.173411899039052
## 34000 min thresh: 0.173411899039052
## 35000 min thresh: 0.173411899039052
## 36000 min thresh: 0.173411899039052
## 37000 min thresh: 0.173411899039052
## 38000 min thresh: 0.173411899039052
## 39000 min thresh: 0.173411899039052
## 40000 min thresh: 0.173411899039052
## 41000 min thresh: 0.173411899039052
## 42000 min thresh: 0.173411899039052
## 43000 min thresh: 0.173411899039052
## 44000 min thresh: 0.173411899039052
## 45000 min thresh: 0.173411899039052
## 46000 min thresh: 0.173411899039052
## 47000 min thresh: 0.173411899039052
## 48000 min thresh: 0.16508440161435
## 49000 min thresh: 0.159308285129043
## 50000 min thresh: 0.149723347670025
## 51000 min thresh: 0.149723347670025
## 52000 min thresh: 0.149475401676328
## 53000 min thresh: 0.148834832168141
## 54000 min thresh: 0.148834832168141
## 55000 min thresh: 0.148834832168141
## 56000 min thresh: 0.148834832168141
## 57000 min thresh: 0.148834832168141
## 58000 min thresh: 0.141241343189484
## 59000 min thresh: 0.141241343189484
## 60000 min thresh: 0.141241343189484
## 61000 min thresh: 0.141241343189484
## 62000 min thresh: 0.141241343189484
## 63000 min thresh: 0.132195524108353
## 64000 min thresh: 0.132195524108353
## 65000 min thresh: 0.123023944550543
## 66000 min thresh: 0.123023944550543
## 67000 min thresh: 0.123023944550543
## 68000 min thresh: 0.123023944550543
## 69000 min thresh: 0.123023944550543
## 70000 min thresh: 0.123023944550543
## 71000 min thresh: 0.116938540335543
## 72000 min thresh: 0.116938540335543
## 73000 min thresh: 0.116938540335543
## 74000 min thresh: 0.116938540335543
## 75000 min thresh: 0.116938540335543
## 76000 min thresh: 0.114354905558193
## 77000 min thresh: 0.114354905558193
## 78000 min thresh: 0.113584327113829
## 79000 min thresh: 0.10482534413861
## 80000 min thresh: 0.101844255864522
## 81000 min thresh: 0.101844255864522
## 82000 min thresh: 0.101844255864522
## 83000 min thresh: 0.101844255864522
## 83447 error: 368.78614 time: 4.04 mins
matches.nn.bigger$result.matrix
## [,1]
## error 3.687861e+02
## reached.threshold 9.953540e-02
## steps 8.344700e+04
## Intercept.to.1layhid1 -7.040357e+00
## GoalsHT.to.1layhid1 -6.963351e+01
## GoalsFT.to.1layhid1 2.325536e+00
## BallPos.to.1layhid1 -6.860489e+01
## ShotsOffTarget.to.1layhid1 1.576957e+01
## ShotsOnTarget.to.1layhid1 -5.792615e+01
## BlockedShots.to.1layhid1 -1.476103e+01
## Corners.to.1layhid1 2.706482e+01
## PassSuccPerc.to.1layhid1 8.196845e+00
## AerialsWon.to.1layhid1 3.732609e+02
## Fouls.to.1layhid1 -2.471538e+01
## YellowCards.to.1layhid1 1.350629e+02
## RedCards.to.1layhid1 1.503123e+01
## Intercept.to.1layhid2 -1.718273e+00
## GoalsHT.to.1layhid2 2.380330e-01
## GoalsFT.to.1layhid2 1.305701e+01
## BallPos.to.1layhid2 -3.207011e+00
## ShotsOffTarget.to.1layhid2 9.811692e+00
## ShotsOnTarget.to.1layhid2 4.050073e+00
## BlockedShots.to.1layhid2 -1.245726e+01
## Corners.to.1layhid2 4.833888e-01
## PassSuccPerc.to.1layhid2 -1.064574e+00
## AerialsWon.to.1layhid2 4.370854e+00
## Fouls.to.1layhid2 -9.626796e+00
## YellowCards.to.1layhid2 -7.587365e+00
## RedCards.to.1layhid2 8.543156e+00
## Intercept.to.1layhid3 -7.383724e+00
## GoalsHT.to.1layhid3 -2.134369e+01
## GoalsFT.to.1layhid3 3.124885e+01
## BallPos.to.1layhid3 -6.214812e+00
## ShotsOffTarget.to.1layhid3 2.762798e+01
## ShotsOnTarget.to.1layhid3 -9.615544e-01
## BlockedShots.to.1layhid3 -1.168879e+01
## Corners.to.1layhid3 -9.688361e+00
## PassSuccPerc.to.1layhid3 -9.267291e+00
## AerialsWon.to.1layhid3 1.957771e+01
## Fouls.to.1layhid3 1.227819e+00
## YellowCards.to.1layhid3 7.131152e+01
## RedCards.to.1layhid3 -8.885766e-01
## Intercept.to.1layhid4 1.492267e+00
## GoalsHT.to.1layhid4 -1.125044e-01
## GoalsFT.to.1layhid4 -3.991955e+00
## BallPos.to.1layhid4 2.878284e-01
## ShotsOffTarget.to.1layhid4 -2.737531e+00
## ShotsOnTarget.to.1layhid4 -9.853406e+00
## BlockedShots.to.1layhid4 -1.018777e+01
## Corners.to.1layhid4 5.830624e-01
## PassSuccPerc.to.1layhid4 1.770311e+00
## AerialsWon.to.1layhid4 5.256519e-01
## Fouls.to.1layhid4 2.259209e+00
## YellowCards.to.1layhid4 -1.608331e+00
## RedCards.to.1layhid4 1.187943e+00
## Intercept.to.1layhid5 3.495855e-01
## GoalsHT.to.1layhid5 -1.370682e+01
## GoalsFT.to.1layhid5 -1.538569e+00
## BallPos.to.1layhid5 -4.472841e+00
## ShotsOffTarget.to.1layhid5 -2.152538e+00
## ShotsOnTarget.to.1layhid5 9.965417e-01
## BlockedShots.to.1layhid5 1.204617e+00
## Corners.to.1layhid5 -3.935971e+00
## PassSuccPerc.to.1layhid5 5.866326e-01
## AerialsWon.to.1layhid5 6.040828e+00
## Fouls.to.1layhid5 1.078237e+01
## YellowCards.to.1layhid5 -1.243631e+01
## RedCards.to.1layhid5 -4.771623e+00
## Intercept.to.1layhid6 1.412137e+00
## GoalsHT.to.1layhid6 -5.605170e+00
## GoalsFT.to.1layhid6 1.435732e+00
## BallPos.to.1layhid6 5.326573e+00
## ShotsOffTarget.to.1layhid6 1.035111e+00
## ShotsOnTarget.to.1layhid6 -5.951718e-01
## BlockedShots.to.1layhid6 -1.034811e+01
## Corners.to.1layhid6 -9.892569e+00
## PassSuccPerc.to.1layhid6 -1.922237e+00
## AerialsWon.to.1layhid6 3.907944e+00
## Fouls.to.1layhid6 -5.408726e-01
## YellowCards.to.1layhid6 6.273727e+00
## RedCards.to.1layhid6 3.955135e+01
## Intercept.to.1layhid7 7.376309e-01
## GoalsHT.to.1layhid7 -1.170902e+01
## GoalsFT.to.1layhid7 -3.268167e+00
## BallPos.to.1layhid7 -7.383349e-01
## ShotsOffTarget.to.1layhid7 1.110020e+00
## ShotsOnTarget.to.1layhid7 5.281076e+00
## BlockedShots.to.1layhid7 -6.003796e+00
## Corners.to.1layhid7 -3.137632e+00
## PassSuccPerc.to.1layhid7 2.333979e+00
## AerialsWon.to.1layhid7 -2.003091e+00
## Fouls.to.1layhid7 1.035093e+00
## YellowCards.to.1layhid7 -8.841806e+00
## RedCards.to.1layhid7 -1.255774e+00
## Intercept.to.1layhid8 -4.900037e-01
## GoalsHT.to.1layhid8 3.783375e-01
## GoalsFT.to.1layhid8 1.193727e+00
## BallPos.to.1layhid8 4.331776e+00
## ShotsOffTarget.to.1layhid8 3.349571e-01
## ShotsOnTarget.to.1layhid8 -2.043312e+00
## BlockedShots.to.1layhid8 -1.311085e+00
## Corners.to.1layhid8 -1.197097e+00
## PassSuccPerc.to.1layhid8 -4.886832e-01
## AerialsWon.to.1layhid8 -8.516111e+00
## Fouls.to.1layhid8 -1.947599e+00
## YellowCards.to.1layhid8 -1.890971e+00
## RedCards.to.1layhid8 1.227204e+00
## Intercept.to.1layhid9 -6.237794e+00
## GoalsHT.to.1layhid9 2.796424e+01
## GoalsFT.to.1layhid9 3.057003e+01
## BallPos.to.1layhid9 -3.989111e+00
## ShotsOffTarget.to.1layhid9 3.780770e+01
## ShotsOnTarget.to.1layhid9 -2.902337e+01
## BlockedShots.to.1layhid9 2.402594e+01
## Corners.to.1layhid9 3.644041e+00
## PassSuccPerc.to.1layhid9 1.383473e+00
## AerialsWon.to.1layhid9 1.467286e+01
## Fouls.to.1layhid9 6.093975e+00
## YellowCards.to.1layhid9 -1.492150e+01
## RedCards.to.1layhid9 -1.495188e+01
## Intercept.to.1layhid10 7.781067e-01
## GoalsHT.to.1layhid10 2.224023e+00
## GoalsFT.to.1layhid10 5.646698e+00
## BallPos.to.1layhid10 -5.813911e-01
## ShotsOffTarget.to.1layhid10 -3.594223e-01
## ShotsOnTarget.to.1layhid10 1.799787e-01
## BlockedShots.to.1layhid10 -5.041433e+00
## Corners.to.1layhid10 -1.808461e+00
## PassSuccPerc.to.1layhid10 -1.353021e+00
## AerialsWon.to.1layhid10 1.455983e+00
## Fouls.to.1layhid10 -4.087517e+00
## YellowCards.to.1layhid10 5.003128e+00
## RedCards.to.1layhid10 8.952499e-01
## Intercept.to.2layhid1 -1.343696e+00
## 1layhid1.to.2layhid1 -4.668598e+00
## 1layhid2.to.2layhid1 -1.000518e+02
## 1layhid3.to.2layhid1 -3.469063e+01
## 1layhid4.to.2layhid1 2.591603e+02
## 1layhid5.to.2layhid1 3.629753e+01
## 1layhid6.to.2layhid1 -3.549011e+01
## 1layhid7.to.2layhid1 -6.470533e+01
## 1layhid8.to.2layhid1 2.127251e+02
## 1layhid9.to.2layhid1 1.433246e+00
## 1layhid10.to.2layhid1 3.464958e+02
## Intercept.to.2layhid2 2.325881e+01
## 1layhid1.to.2layhid2 7.278032e+01
## 1layhid2.to.2layhid2 8.895268e+01
## 1layhid3.to.2layhid2 4.490044e+01
## 1layhid4.to.2layhid2 -4.755002e+02
## 1layhid5.to.2layhid2 1.943281e-01
## 1layhid6.to.2layhid2 3.215772e+01
## 1layhid7.to.2layhid2 -7.916807e+02
## 1layhid8.to.2layhid2 3.216913e+02
## 1layhid9.to.2layhid2 2.371741e+02
## 1layhid10.to.2layhid2 -1.667384e+02
## Intercept.to.2layhid3 -1.231244e+00
## 1layhid1.to.2layhid3 1.695450e-01
## 1layhid2.to.2layhid3 2.599956e+00
## 1layhid3.to.2layhid3 -6.177247e+00
## 1layhid4.to.2layhid3 -2.047378e+00
## 1layhid5.to.2layhid3 3.948157e+00
## 1layhid6.to.2layhid3 3.691152e+00
## 1layhid7.to.2layhid3 -3.349184e+01
## 1layhid8.to.2layhid3 6.717702e+00
## 1layhid9.to.2layhid3 -2.767796e+00
## 1layhid10.to.2layhid3 5.228598e+00
## Intercept.to.2layhid4 -4.819153e-01
## 1layhid1.to.2layhid4 1.411761e+00
## 1layhid2.to.2layhid4 -1.990714e+00
## 1layhid3.to.2layhid4 6.502826e-01
## 1layhid4.to.2layhid4 -2.550957e+00
## 1layhid5.to.2layhid4 4.441440e-02
## 1layhid6.to.2layhid4 -8.066907e-01
## 1layhid7.to.2layhid4 1.106835e+01
## 1layhid8.to.2layhid4 1.769043e+00
## 1layhid9.to.2layhid4 1.519993e-01
## 1layhid10.to.2layhid4 8.674581e-01
## Intercept.to.2layhid5 1.901063e+00
## 1layhid1.to.2layhid5 -8.410928e-01
## 1layhid2.to.2layhid5 6.400484e-01
## 1layhid3.to.2layhid5 -2.136497e+00
## 1layhid4.to.2layhid5 -2.571489e+00
## 1layhid5.to.2layhid5 -4.516643e-02
## 1layhid6.to.2layhid5 -9.829948e-01
## 1layhid7.to.2layhid5 1.067412e+01
## 1layhid8.to.2layhid5 -2.612776e+00
## 1layhid9.to.2layhid5 7.231247e-01
## 1layhid10.to.2layhid5 2.853550e+00
## Intercept.to.3layhid1 2.158957e+00
## 2layhid1.to.3layhid1 -4.442133e+00
## 2layhid2.to.3layhid1 -2.839498e-01
## 2layhid3.to.3layhid1 3.229194e-01
## 2layhid4.to.3layhid1 4.696510e-01
## 2layhid5.to.3layhid1 -4.038692e-01
## Intercept.to.3layhid2 1.489810e+01
## 2layhid1.to.3layhid2 1.163838e+01
## 2layhid2.to.3layhid2 -7.459596e+00
## 2layhid3.to.3layhid2 -6.775711e+03
## 2layhid4.to.3layhid2 -9.338674e+01
## 2layhid5.to.3layhid2 4.379107e+02
## Intercept.to.3layhid3 3.198636e+00
## 2layhid1.to.3layhid3 -7.765468e+00
## 2layhid2.to.3layhid3 1.406738e+00
## 2layhid3.to.3layhid3 -2.478482e+00
## 2layhid4.to.3layhid3 -7.652450e+00
## 2layhid5.to.3layhid3 7.378313e-01
## Intercept.to.home 3.347748e+00
## 3layhid1.to.home -5.611594e+01
## 3layhid2.to.home 1.431989e+00
## 3layhid3.to.home 2.027931e+03
plot(matches.nn.bigger, rep = "best", arrow.length = 0.15, information = T)
pred.nn.bigger.train <- predict(matches.nn.bigger,
matches.train,
type = "response")
nn.bigger.train.confMat = confusionMatrix(data = as.factor(ifelse(pred.nn.bigger.train > 0.5, 1, 0)), # predictions
# actual values
reference = as.factor(matches.train.nn$home),
# definitions of the "success" label
positive = "1")
nn.bigger.train.confMat
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 210 197
## 1 1652 1665
##
## Accuracy : 0.5035
## 95% CI : (0.4873, 0.5197)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.341
##
## Kappa : 0.007
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8942
## Specificity : 0.1128
## Pos Pred Value : 0.5020
## Neg Pred Value : 0.5160
## Prevalence : 0.5000
## Detection Rate : 0.4471
## Detection Prevalence : 0.8907
## Balanced Accuracy : 0.5035
##
## 'Positive' Class : 1
##
pred.nn.bigger.test <- predict(matches.nn.bigger,
matches.test,
type = "response")
nn.bigger.test.confMat = confusionMatrix(data = as.factor(ifelse(pred.nn.bigger.test > 0.5, 1, 0)), # predictions
# actual values
reference = as.factor(matches.test.nn$home),
# definitions of the "success" label
positive = "1")
nn.bigger.test.confMat
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 106 87
## 1 692 711
##
## Accuracy : 0.5119
## 95% CI : (0.4871, 0.5367)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.1772
##
## Kappa : 0.0238
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8910
## Specificity : 0.1328
## Pos Pred Value : 0.5068
## Neg Pred Value : 0.5492
## Prevalence : 0.5000
## Detection Rate : 0.4455
## Detection Prevalence : 0.8791
## Balanced Accuracy : 0.5119
##
## 'Positive' Class : 1
##
Increasing the threshold and enlarging the network resulted in worse accuracy than for the simpler neural network.
ROC.nn.train <-
pROC::roc(as.numeric(matches.train$home == "yes"),
predict(matches.nn, matches.train.nn[, -13]))
## Setting levels: control = 0, case = 1
## Warning in roc.default(as.numeric(matches.train$home == "yes"),
## predict(matches.nn, : Deprecated use a matrix as predictor. Unexpected results
## may be produced, please pass a numeric vector.
## Setting direction: controls < cases
ROC.nn.test <-
pROC::roc(as.numeric(matches.test$home == "yes"),
predict(matches.nn, matches.test.nn[, -13]))
## Setting levels: control = 0, case = 1
## Warning in roc.default(as.numeric(matches.test$home == "yes"),
## predict(matches.nn, : Deprecated use a matrix as predictor. Unexpected results
## may be produced, please pass a numeric vector.
## Setting direction: controls < cases
list(
ROC.nn.train = ROC.nn.train,
ROC.nn.test = ROC.nn.test
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"nn = ",
round(100*(2 * auc(ROC.nn.train) - 1), 1), "%, ",
"Gini TEST: ",
"nn = ",
round(100*(2 * auc(ROC.nn.test) - 1), 1), "%, "
)) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
Obtained Gini index values for test data set are much worse than for the previously presented methods.
The last method is a neural network with the use of Keras package.
The network consists of 3 dense layers with 256, 128 and 1 neurons.
library(keras)
library(tensorflow)
##
## Dołączanie pakietu: 'tensorflow'
## Następujący obiekt został zakryty z 'package:caret':
##
## train
model <- keras_model_sequential()
## Loaded Tensorflow version 2.8.0
model %>%
layer_dense(units = 256, activation = "sigmoid",
input_shape = 12) %>%
layer_dense(units = 128, activation = "sigmoid") %>%
layer_dense(units = 1, activation = "sigmoid")
model %>%
compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = tf$keras$metrics$AUC()
# metric = "AUC"
)
model %>%
fit(as.matrix(matches.train.nn[, -13]),
matches.train.nn[, 13],
epochs = 500,
batch_size = 1024,
verbose = 0)
result <- evaluate(model, as.matrix(matches.test.nn[, -13]), matches.test.nn[, 13])
result
## loss auc
## 0.6616802 0.6400108
ROC.keras.train <-
pROC::roc(matches.train.nn[, 13] == 1,
predict(model, as.matrix(matches.train.nn[, -13])))
## Setting levels: control = FALSE, case = TRUE
## Warning in roc.default(matches.train.nn[, 13] == 1, predict(model,
## as.matrix(matches.train.nn[, : Deprecated use a matrix as predictor. Unexpected
## results may be produced, please pass a numeric vector.
## Setting direction: controls < cases
ROC.keras.test <-
pROC::roc(as.numeric(matches.test.nn[, 13] == 1),
model %>% predict(as.matrix(matches.test.nn[, -13])) %>% as.numeric())
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
matches.pred.train.keras <- predict(model, as.matrix(matches.train.nn[, -13]))
matches.pred.test.keras <- predict(model, as.matrix(matches.test.nn[, -13]))
(confusionMatrix.train.keras <-
confusionMatrix(as.numeric((matches.pred.train.keras > 0.5)) %>% as.factor(),
as.factor(ifelse(matches.train$home == "yes", 1, 0))) )
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1248 839
## 1 614 1023
##
## Accuracy : 0.6098
## 95% CI : (0.5939, 0.6255)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2197
##
## Mcnemar's Test P-Value : 4.192e-09
##
## Sensitivity : 0.6702
## Specificity : 0.5494
## Pos Pred Value : 0.5980
## Neg Pred Value : 0.6249
## Prevalence : 0.5000
## Detection Rate : 0.3351
## Detection Prevalence : 0.5604
## Balanced Accuracy : 0.6098
##
## 'Positive' Class : 0
##
(confusionMatrix.keras.test <-
confusionMatrix(as.numeric((matches.pred.test.keras > 0.5)) %>% as.factor(),
as.factor(ifelse(matches.test$home == "yes", 1, 0))) )
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 530 369
## 1 268 429
##
## Accuracy : 0.6009
## 95% CI : (0.5764, 0.625)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 3.749e-16
##
## Kappa : 0.2018
##
## Mcnemar's Test P-Value : 7.428e-05
##
## Sensitivity : 0.6642
## Specificity : 0.5376
## Pos Pred Value : 0.5895
## Neg Pred Value : 0.6155
## Prevalence : 0.5000
## Detection Rate : 0.3321
## Detection Prevalence : 0.5633
## Balanced Accuracy : 0.6009
##
## 'Positive' Class : 0
##
# confusionMatrix.keras$overall[['Accuracy']]
Accuracy for train sample is 0.595 and for test sample 0.596.
list(
ROC.keras.train = ROC.keras.train,
ROC.keras.test = ROC.keras.test
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"keras = ",
round(100*(2 * auc(ROC.keras.train) - 1), 1), "%, ",
"Gini TEST: ",
"keras = ",
round(100*(2 * auc(ROC.keras.test) - 1), 1), "%, "
)) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
Gini coefficient is quite high – 28.1% for the test data.
list(
ROC.logit.train = ROC.logit.train,
ROC.logit.test = ROC.logit.test,
ROC.tree.pruned.train = ROC.tree.pruned.train,
ROC.tree.pruned.test = ROC.tree.pruned.test,
ROC.rf2.train = ROC.rf2.train,
ROC.rf2.test = ROC.rf2.test,
ROC.xgb6.train = ROC.xgb6.train,
ROC.xgb6.test = ROC.xgb6.test,
ROC.nn.train = ROC.nn.train,
ROC.nn.test = ROC.nn.test,
ROC.keras.train = ROC.keras.train,
ROC.keras.test = ROC.keras.test
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"logit = ",
round(100*(2 * auc(ROC.logit.train) - 1), 1), "%, ",
"tree = ",
round(100*(2 * auc(ROC.tree.pruned.train) - 1), 1), "%, ",
"RF = ",
round(100*(2 * auc(ROC.rf2.train) - 1), 1), "%, ",
"XGB = ",
round(100*(2 * auc(ROC.xgb6.train) - 1), 1), "%, ",
"nn = ",
round(100*(2 * auc(ROC.nn.train) - 1), 1), "%, ",
"keras = ",
round(100*(2 * auc(ROC.keras.train) - 1), 1), "%, ",
"\nGini TEST: ",
"logit = ",
round(100*(2 * auc(ROC.logit.test) - 1), 1), "%, ",
"tree = ",
round(100*(2 * auc(ROC.tree.pruned.test) - 1), 1), "%, ",
"RF = ",
round(100*(2 * auc(ROC.rf2.test) - 1), 1), "%, ",
"XGB = ",
round(100*(2 * auc(ROC.xgb6.test) - 1), 1), "%, ",
"nn = ",
round(100*(2 * auc(ROC.nn.test) - 1), 1), "% ",
"keras = ",
round(100*(2 * auc(ROC.keras.test) - 1), 1), "% "
)) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
# round(100*(2 * auc(ROC.logit.train) - 1), 1)
# round(100*(2 * auc(ROC.tree.pruned.train) - 1), 1)
# round(100*(2 * auc(ROC.rf2.train) - 1), 1)
# round(100*(2 * auc(ROC.xgb6.train) - 1), 1)
# round(100*(2 * auc(ROC.nn.train) - 1), 1)
# round(100*(2 * auc(ROC.keras.train) - 1), 1)
# round(100*(2 * auc(ROC.logit.test) - 1), 1)
# round(100*(2 * auc(ROC.tree.pruned.test) - 1), 1)
# round(100*(2 * auc(ROC.rf2.test) - 1), 1)
# round(100*(2 * auc(ROC.xgb6.test) - 1), 1)
# round(100*(2 * auc(ROC.nn.test) - 1), 1)
# round(100*(2 * auc(ROC.keras.test) - 1), 1)
Different ML methods and classical econometric method (logistic regression) were used to predict if a team played its match as a host or a guest team. As can be seen in the above graph the best classification results in terms of the Gini coefficient were obtained with the logistic regression. Second in line is the neural network built with the Keras package.
rNames = c('logit', 'pruned tree', 'RF', 'XGB', 'nn', 'keras')
accuracies = c(confusionMatrix.logit.test$overall[['Accuracy']],
confusionMatrix.tree.pruned.test$overall[['Accuracy']],
confusionMatrix.rf2.test$overall[['Accuracy']],
confusionMatrix.xgb6.test$overall[['Accuracy']],
confusionMatrix.nn.test$overall[['Accuracy']],
confusionMatrix.keras.test$overall[['Accuracy']])
sensitivities = c(confusionMatrix.logit.test$byClass[['Sensitivity']],
confusionMatrix.tree.pruned.test$byClass[['Sensitivity']],
confusionMatrix.rf2.test$byClass[['Sensitivity']],
confusionMatrix.xgb6.test$byClass[['Sensitivity']],
confusionMatrix.nn.test$byClass[['Sensitivity']],
confusionMatrix.keras.test$byClass[['Sensitivity']])
specificities = c(confusionMatrix.logit.test$byClass[['Specificity']],
confusionMatrix.tree.pruned.test$byClass[['Specificity']],
confusionMatrix.rf2.test$byClass[['Specificity']],
confusionMatrix.xgb6.test$byClass[['Specificity']],
confusionMatrix.nn.test$byClass[['Specificity']],
confusionMatrix.keras.test$byClass[['Specificity']])
balancedAccuracies = c(confusionMatrix.logit.test$byClass[['Balanced Accuracy']],
confusionMatrix.tree.pruned.test$byClass[['Balanced Accuracy']],
confusionMatrix.rf2.test$byClass[['Balanced Accuracy']],
confusionMatrix.xgb6.test$byClass[['Balanced Accuracy']],
confusionMatrix.nn.test$byClass[['Balanced Accuracy']],
confusionMatrix.keras.test$byClass[['Balanced Accuracy']])
F1s = c(confusionMatrix.logit.test$byClass[['F1']],
confusionMatrix.tree.pruned.test$byClass[['F1']],
confusionMatrix.rf2.test$byClass[['F1']],
confusionMatrix.xgb6.test$byClass[['F1']],
confusionMatrix.nn.test$byClass[['F1']],
confusionMatrix.keras.test$byClass[['F1']])
sum = data.frame(Accuracy = accuracies, Sensitivity = sensitivities,
Specificity = specificities, 'Balanced Accuracy' = balancedAccuracies,
F1 = F1s,
row.names = rNames)
sum
## Accuracy Sensitivity Specificity Balanced.Accuracy F1
## logit 0.6040100 0.5576441 0.6503759 0.6040100 0.5847569
## pruned tree 0.5795739 0.6115288 0.5476190 0.5795739 0.5925926
## RF 0.5946115 0.5551378 0.6340852 0.5946115 0.5779517
## XGB 0.6058897 0.5814536 0.6303258 0.6058897 0.5960180
## nn 0.5620301 0.4423559 0.6817043 0.5620301 0.5024911
## keras 0.6008772 0.6641604 0.5375940 0.6008772 0.6246317
Accuracy of classification is quite similar for all methods. Sensitivity is the best for neural network built with “Keras” package, while the worst is for the neural network built with the “neuralnet” package. Specificity is the best for neural network from “neuralnet” package, while the worst for neural network from “Keras” package – the opposite to the sensitivity. Balanced accuracy statistic is the best for the XGB. High value of this statistic is also obtained for logistic regression. F1 statistic is noticeably the best for neural network from “Keras” package, the worst is for “neuralnet” neural network.
Presented values of the statistics show that the predictive value of our models is not perfect. There is still some room for possible improvements, for example testing another neural networks layouts.