Classification Tree and Binary Logistic Regression Models
Home and Away Teams
Introduction
Previous research (Lago-Penas & Lago-Ballesteros, 2011) showed that, in the Spanish League of 2008-2009, when teams were divided into four quarters based on quality, the Home team won 62% of the games and Home team wins ranged from 53% when two top quarter teams met, to 82% when a top quarter Home team met either a second, third or fourth quarter team. The headline rate of 62% was the same as that of a previous worldwide analysis (Pollard & Pollard 2005; Pollard 2006).
Drawing on 600 games in the English Premier League, this study developed a model to predict Home and Away teams based on categorical variables: Win/Lose/Draw at half-time and full-time and on numeric variables: difference in full-time goals and half-time goals; difference in shots made and difference in shots on target; difference in fouls made and in corners awarded; difference in yellow cards shown and red cards shown. The importance of the variables to the model and the model’s overall classification accuracy and generalisability were assessed.
Drawing on 160 games in the English Premier League, the accuracy of the model’s classification of Home Teams and Away Teams was tested. Again, the variables’ importance to the model and its overall accuracy were assessed.
By comparing both results, a conclusion was drawn as to whether the classification model was over-fitting the training dataset.
Create and Visualise Classification Tree
Interpret Classification Tree
One Rule for predicting if a team is a Home Team
If the Team wins, ie. its full time (ft) result is not a Draw or not a Loss (wdl_ft = Draw, Lose - No), the Team is classified as a Home Team. There is a .62 probability that the Teams in this leaf are Home Teams, which does not indicate high purity.
One Rule for predicting if a team is an Away Team
If the Team draws or loses (wdl_ft = Draw, Lose), then the Team is classified as an Away Team. There is a .58 probability that teams in this leaf are Away teams, which does not indicate high purity.
Variable Percentage Importance
c_diff ftg_diff wdl_ft s_diff st_diff htg_diff wdl_ht f_diff
20 18 17 13 9 8 8 4
y_diff
2
Each variable’s importance is shown as a percentage of total improvement to the starting 50/50 model. Corner difference (c_diff), Full Time goal difference (ftg_diff) and Win, Draw or Lose at full time (wdl_ft) accounted for 20%, 18% and 17%, respectively, of improvements. Red card difference (r_diff) contributed less than 1% to improvement and is not shown.
Model’s Confusion Matrix and Accuracy on Training Data
Predicted
Actual Away Home
Away 183 119
Home 91 207
Drawing on the training data, the Confusion Matrix shows that three hundred and ninety (183 + 207) of the six hundred games were correctly predicted, which is equivalent to 65% Accuracy shown in [1] below.
In terms of predicting Home Teams, the Classification correctly predicted 207 out of 298 (207 + 91), indicating a 69.6% accuracy of predicting a Home Team.
In terms of predicting Away Teams, the Classification correctly identified 183 out of 302 (183 + 119) indicating a 60.6% accuracy of predicting an Away Team.
[1] "65%"
Model’s Confusion Matrix and Accuracy on Testing Data
Predicted
Actual Away Home
Away 43 35
Home 24 58
Drawing on the testing data, the Confusion Matrix shows that 101 (43 + 58) out of 160 test games were correctly classified by the model, which is equivalent to 63.1% overall Accuracy shown in [1] below.
In terms of predicting Home Teams the Classification correctly identified 58 out of 82 (58 + 24) indicating a 70.7% accuracy of predicting a Home Team with the Test Data.
In terms of predicting Away Teams, the Classification correctly identified 43 out of 78 (43 + 35) indicating a 55.1% accuracy of predicting an Away Team with the Test Data.
[1] "63.1%"
Conclusion
Each variable’s importance to the Classification Tree Model was based on its percentage contribution to the total improvement of the starting 50/50 model. Corner difference (c_diff), Full Time goal difference (ftg_diff), Win, Draw or Lose at full time (wdl_ft), Shot Difference (s_diff), contributed 20%, 18%, 17% and 13%, respectively, to improvement in the model, with three other variables contributing between 8% and 9%.
The overall accuracy of the Classification Model was 65% when we used the training data and 63% when we used the testing data. This suggests that the model would generalise well with new data.
When the training data was used, the Classification Model correctly identified Home Teams and Away Teams 70% and 61% of the time, respectively.
When testing data was used, the Classification Model correctly classified Home Teams and Away Teams 71% and 55% of the time, respectively.
The overall accuracy from the training (65%) and testing (63%) data were similar, suggesting that there was no evidence that the Classification Model overfitted the training data.
Similarly, when the accuracy of predicting the Home Team was compared on the Training (70%) and Testing (71%) there was little evidence of over-fitting of the training data.
However, when the accuracy of predicting an Away team on the Training data (61%) was compared to that of the Testing data (55%) there was some suggestion of overfitting of the training data. The Classification Tree model was a poor predictor of an Away Team.
Predict Home and Away Team - Binary Logistic Regression
Using the same data from the English Premier League, the accuracy of a Binary Logistic Regression Model in predicting Home Teams and Away Teams was tested. Again, the variables’ importance to the model and Model’s overall accuracy were assessed. By comparing results of the model with Training and Testing Data, a conclusion was drawn as to whether the Binary Logistic Regression model was over-fitting the training data.
Finally, the Classification Tree and Binary Logistic Regression Models were compared to assess their accuracy and compare the variables that were important to each.
Create predictor variables
The predictor variables consisted of two categorical variables (wdl_ft; wdl_ht) and eight numerical variables (ftg_diff; htg_diff; s_diff; st_diff; f_diff; c_diff; y_diff; r_diff)
Set up levels of response variable
The binary variable ‘home_or_away’ was the target variable. The first level of the target variable was ‘failure’ ‘Away’) and the second level was ‘success’ (‘Home’). The required ‘order’ was specified in the code for the columns. The resulting order for the training and the testing data is shown on the two separate lines in [1] below
[1] "Away" "Home"
[1] "Away" "Home"
The Logistic Regression Equation
Formally, the Regression Equation reads: y = log(pi/(1-pi)) = β0 + β1X1 + β2X2 +…+ βnXxn, where B0 is the intercept and β1….βn are the variable coefficients, an Estimate of which is shown below with their P-Value and a Code for their Significance level.
Call:
NULL
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.145 0.189 -0.771 0.441
wdl_ftLose -0.425 0.301 -1.412 0.158
wdl_ftWin 0.563 0.301 1.873 0.061 .
wdl_htLose 0.091 0.321 0.285 0.776
wdl_htWin 0.178 0.326 0.548 0.584
ftg_diff -0.147 0.119 -1.237 0.216
htg_diff 0.097 0.184 0.525 0.599
s_diff 0.046 0.018 2.549 0.011 *
st_diff -0.020 0.041 -0.499 0.618
f_diff -0.020 0.019 -1.032 0.302
c_diff 0.019 0.025 0.742 0.458
y_diff -0.005 0.057 -0.087 0.931
r_diff 0.226 0.260 0.868 0.386
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 831.75 on 599 degrees of freedom
Residual deviance: 780.45 on 587 degrees of freedom
AIC: 806.45
Number of Fisher Scoring iterations: 4
Specifically, the Regression Equation reads; y = ln(pi/(1-pi)) = -0.145 - 0.425 (wdl-ftLose) + 0.563 (wdl_ftWin) + 0.091 (wdl_htWin) +0.178 (wdl_htWin) - 0.147 (ftg_diff) + 0.097 (htg_diff) +0.046 (s_diff) - 0.020 st_diff) - 0.020 (f_diff) +0.019 (c_diff) - 0.005 (y_diff) + 0.226 (r_diff).
Important Predictor Variables
The absolute P-Value of each predictor variable is shown in the last column of the Coefficients’ Estimate as (Pr(>|z|)). The Significance Code for each absolute P-Value indicates its significance level. The Null Hypothesis tests the claim that the co-efficient is = 0 and consequently not important in predicting the outcome. At a significance level of 0.05, there are two important predictor variables. The Shot Difference (s_diff) is important (p = 0.011, significant at 0.01 level). The Win at Full Time (wdl_ftWin) is also important (p = 0.061; significant at 0.05 level). Since the significance level of the other co-efficients exceeds 0.05, they are considered unimportant predictors of ‘Home Team’ and ‘Away Team’.
Impact of Important Predictor Variables on Odds of Team being classified as Home Team
A positive coefficient sign indicates that the variable in question, contributes positively to the odds of a team being predicted (classified) as a Home Team. Both statistically significant coefficient estimates are positive. Therefore, for every unit increase in difference in shots made (s_diff), the odds of being a Home Team is multiplied by e0.046 = 1.047 and the odds of the team being a Home Team increases by 4.7%. If a team Wins at Full Time (wdl_ftWin), the odds of being a Home Team is multiplied by e0.563 = 1.756 or the odds of the team being a Home Team increases by 75.6%.
Accuracy of Binary Logistic Regression on Training Data
Predicted
Actual Away Home
Away 194 108
Home 119 179
Drawing on the training data, the Confusion Matrix shows that 373 (194 + 179) out of the 600 games were correctly classified by the model, which is equivalent to 62.2% overall Accuracy shown in [1] below.
Drawing on the training data, the Binary Logistic Regression Model correctly classified 179 Home Teams out of an Actual total of Home teams of 298 (179 + 119) or 60.1% and 194 Away teams out of an Actual total of 302 Away Teams (194 + 108) or 64.2%.
[1] "62.2%"
Accuracy of Binary Logistic Regression on Testing Data
Predicted
Actual Away Home
Away 53 25
Home 30 52
Drawing on the test data, the Confusion Matrix shows that 105 (53 + 52) out of 160 games were correctly classified by the model, which is equivalent to 65.6% overall Accuracy as shown in [1] below.
Drawing on the test data, the Binary Logistic Regression Model correctly classified 52 Home Teams out of an Actual total of 82 (52 + 30) Home teams or 63.4%.
Drawing on the test data, the Binary Logistic Regression Model correctly predicted 53 Away Teams out of an Actual total of 78 (53 + 25) Away teams or 67.9%.
[1] "65.6%"
Conclusion
The overall accuracy of the Binary Logistic Regression Model was 62.2% when we used the training data and 65.6% when we used the testing data. This suggests that the model was more accurate when using testing data and that the training model did not overfit the data.
Drawing on the training data, the Binary Logistic Regression Model correctly predicted the Home Team and Away Team 60.1% and 64.2% of the time, respectively, which suggested that the model generalised well.
Drawing on the testing data, the Binary Logistic Regression Model correctly predicted the Home Team and Away Team 63.4% and 67.9% of the time, respectively, again suggesting that the model generalised well.
Compare and Contrast Classification Tree and Binary Logistic Regression Models
Model Accuracy
The performance of the two models, in terms of overall accuracy and accuracy in predicting Home and Away Teams on the Training and Testing data is summarised in Table 1
Table 1. Comparing Accuracy of Classification Tree and Binary Logistic Regression
| Model | Training | Testing | Training Classify Home Team | Training Classify Away Team | Testing Classify Home Team | Testing Classify Away Team |
|---|---|---|---|---|---|---|
| ClassificationTree | 65% | 63% | 70% | 61% | 71% | 55% |
| Binary Logistic Regression | 62% | 66% | 60% | 64% | 63% | 68% |
The Classification Tree was 3% more accurate than the Binary Logistic Regression Model with the training data (65% v 62%) but 3% less accurate with the testing data (63% v 66%), suggesting that the Logistic Regression Model was more desirable. As there was little discrepancy between accuracy on the training and testing data, there was no evidence of over-fitting by the models on the Training data. The Classification Tree was eight percentage points more accurate (71% v 63%) in predicting the Home Team with the testing data but thirteen percentage points less accurate (55% v 68%) in predicting the Away Team on the testing data, thus indicating a better performance all around by the Binary Regression Model.
The Binary Logistic Regression Model identified two significantly important variables in the prediction of the Home Team.
Overall Conclusion
The Logistic Regression Model was superior to the Classification Tree Model, in terms of Overall Accuracy with the Testing Data, and in terms of correctly classifying Home Teams and Away Teams with the Testing Data.
While previous research identified that, on average, 62% of wins are by Home Teams, this study showed that the Home Team can be accurately classified 63% to 71% of the time, and the Away Team 55% - 68% of the time, depending on whether the Classification Tree or the Binary Logistic Regression Model is used.
Using the training and testing data, neither model indicated over-fitting, in terms of overall accuracy, with the Binary Logistic Regression model performing better with the testing data. The Binary Regression Model provided a more even (63% to 68%) accuracy when predicting Home Teams and Away Teams on the Testing data compared to the Classification Tree (71% to 65%).
In terms of parsimony, the Logistic Regression Model identified two statistically significant variables which contributed to the correct classification of Home Teams and Away Teams. The Classification Tree used three variables whose importance was approximately similar around 19% and another whose importance was 13% and three whose importance was between 8% and 9%.
The Logistic Regression Model allowed the calculation of the percentage increase in correct classification arising from a unit increase in each of the two predictor variables which allowed the calculation of ‘odds ratio’ impact.
Overall, the Binary Logistic Regression Model displayed more desirable characteristics.
References
Lago-Ballesteros, J., Lago-Penas, C. (20112). Game Location and Team Quality Effects on Performance Profiles in Professional Soccer. Journal of Sports Science and Medicine, 10, 465-471.
Pollard, R. & Pollard, G. (2005). Longterm Trends in Home Advantage in Professional Team Sport in North America and England (1876 - 2003). Journal of Sports Sciences 23(4). 337-350
Pollard, R. (2006) Home Advantage in Soccer:Variations in its Magnitude and a Literature Review of the Inter-Related Factors Associated with its Existence. Journal of Sports Behaviour. 29, 169-189.