Classification Tree and Binary Logistic Regression Models

Home and Away Teams

Introduction

Previous research (Lago-Penas & Lago-Ballesteros, 2011) showed that, in the Spanish League of 2008-2009, when teams were divided into four quarters based on quality, the Home team won 62% of the games and Home team wins ranged from 53% when two top quarter teams met, to 82% when a top quarter Home team met either a second, third or fourth quarter team. The headline rate of 62% was the same as that of a previous worldwide analysis (Pollard & Pollard 2005; Pollard 2006).

Drawing on 600 games in the English Premier League, this study developed a model to predict Home and Away teams based on categorical variables: Win/Lose/Draw at half-time and full-time and on numeric variables: difference in full-time goals and half-time goals; difference in shots made and difference in shots on target; difference in fouls made and in corners awarded; difference in yellow cards shown and red cards shown. The importance of the variables to the model and the model’s overall classification accuracy and generalisability were assessed.

Drawing on 160 games in the English Premier League, the accuracy of the model’s classification of Home Teams and Away Teams was tested. Again, the variables’ importance to the model and its overall accuracy were assessed.

By comparing both results, a conclusion was drawn as to whether the classification model was over-fitting the training dataset.

Create and Visualise Classification Tree

Interpret Classification Tree

One Rule for predicting if a team is a Home Team

If the Team wins, ie. its full time (ft) result is not a Draw or not a Loss (wdl_ft = Draw, Lose - No), the Team is classified as a Home Team. There is a .62 probability that the Teams in this leaf are Home Teams, which does not indicate high purity.

One Rule for predicting if a team is an Away Team

If the Team draws or loses (wdl_ft = Draw, Lose), then the Team is classified as an Away Team. There is a .58 probability that teams in this leaf are Away teams, which does not indicate high purity.

Variable Percentage Importance

  c_diff ftg_diff   wdl_ft   s_diff  st_diff htg_diff   wdl_ht   f_diff 
      20       18       17       13        9        8        8        4 
  y_diff 
       2

Each variable’s importance is shown as a percentage of total improvement to the starting 50/50 model. Corner difference (c_diff), Full Time goal difference (ftg_diff) and Win, Draw or Lose at full time (wdl_ft) accounted for 20%, 18% and 17%, respectively, of improvements. Red card difference (r_diff) contributed less than 1% to improvement and is not shown.

Model’s Confusion Matrix and Accuracy on Training Data

      Predicted
Actual Away Home
  Away  183  119
  Home   91  207

Drawing on the training data, the Confusion Matrix shows that three hundred and ninety (183 + 207) of the six hundred games were correctly predicted, which is equivalent to 65% Accuracy shown in [1] below.

In terms of predicting Home Teams, the Classification correctly predicted 207 out of 298 (207 + 91), indicating a 69.6% accuracy of predicting a Home Team.

In terms of predicting Away Teams, the Classification correctly identified 183 out of 302 (183 + 119) indicating a 60.6% accuracy of predicting an Away Team.

[1] "65%"

Model’s Confusion Matrix and Accuracy on Testing Data

      Predicted
Actual Away Home
  Away   43   35
  Home   24   58

Drawing on the testing data, the Confusion Matrix shows that 101 (43 + 58) out of 160 test games were correctly classified by the model, which is equivalent to 63.1% overall Accuracy shown in [1] below.

In terms of predicting Home Teams the Classification correctly identified 58 out of 82 (58 + 24) indicating a 70.7% accuracy of predicting a Home Team with the Test Data.

In terms of predicting Away Teams, the Classification correctly identified 43 out of 78 (43 + 35) indicating a 55.1% accuracy of predicting an Away Team with the Test Data.

[1] "63.1%"

Conclusion

Each variable’s importance to the Classification Tree Model was based on its percentage contribution to the total improvement of the starting 50/50 model. Corner difference (c_diff), Full Time goal difference (ftg_diff), Win, Draw or Lose at full time (wdl_ft), Shot Difference (s_diff), contributed 20%, 18%, 17% and 13%, respectively, to improvement in the model, with three other variables contributing between 8% and 9%.

The overall accuracy of the Classification Model was 65% when we used the training data and 63% when we used the testing data. This suggests that the model would generalise well with new data.

When the training data was used, the Classification Model correctly identified Home Teams and Away Teams 70% and 61% of the time, respectively.

When testing data was used, the Classification Model correctly classified Home Teams and Away Teams 71% and 55% of the time, respectively.

The overall accuracy from the training (65%) and testing (63%) data were similar, suggesting that there was no evidence that the Classification Model overfitted the training data.

Similarly, when the accuracy of predicting the Home Team was compared on the Training (70%) and Testing (71%) there was little evidence of over-fitting of the training data.

However, when the accuracy of predicting an Away team on the Training data (61%) was compared to that of the Testing data (55%) there was some suggestion of overfitting of the training data. The Classification Tree model was a poor predictor of an Away Team.

Predict Home and Away Team - Binary Logistic Regression

Using the same data from the English Premier League, the accuracy of a Binary Logistic Regression Model in predicting Home Teams and Away Teams was tested. Again, the variables’ importance to the model and Model’s overall accuracy were assessed. By comparing results of the model with Training and Testing Data, a conclusion was drawn as to whether the Binary Logistic Regression model was over-fitting the training data.

Finally, the Classification Tree and Binary Logistic Regression Models were compared to assess their accuracy and compare the variables that were important to each.

Create predictor variables

The predictor variables consisted of two categorical variables (wdl_ft; wdl_ht) and eight numerical variables (ftg_diff; htg_diff; s_diff; st_diff; f_diff; c_diff; y_diff; r_diff)

Set up levels of response variable

The binary variable ‘home_or_away’ was the target variable. The first level of the target variable was ‘failure’ ‘Away’) and the second level was ‘success’ (‘Home’). The required ‘order’ was specified in the code for the columns. The resulting order for the training and the testing data is shown on the two separate lines in [1] below

[1] "Away" "Home"

[1] "Away" "Home"

The Logistic Regression Equation

Formally, the Regression Equation reads: y = log(pi/(1-pi)) = β₀ + β₁X₁ + β₂X₂ +…+ βnXxn, where B₀ is the intercept and β₁….βn are the variable coefficients, an Estimate of which is shown below with their P-Value and a Code for their Significance level.


Call:
NULL

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   -0.145      0.189  -0.771    0.441  
wdl_ftLose    -0.425      0.301  -1.412    0.158  
wdl_ftWin      0.563      0.301   1.873    0.061 .
wdl_htLose     0.091      0.321   0.285    0.776  
wdl_htWin      0.178      0.326   0.548    0.584  
ftg_diff      -0.147      0.119  -1.237    0.216  
htg_diff       0.097      0.184   0.525    0.599  
s_diff         0.046      0.018   2.549    0.011 *
st_diff       -0.020      0.041  -0.499    0.618  
f_diff        -0.020      0.019  -1.032    0.302  
c_diff         0.019      0.025   0.742    0.458  
y_diff        -0.005      0.057  -0.087    0.931  
r_diff         0.226      0.260   0.868    0.386  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 831.75  on 599  degrees of freedom
Residual deviance: 780.45  on 587  degrees of freedom
AIC: 806.45

Number of Fisher Scoring iterations: 4

Specifically, the Regression Equation reads; y = ln(pi/(1-pi)) = -0.145 - 0.425 (wdl-ftLose) + 0.563 (wdl_ftWin) + 0.091 (wdl_htWin) +0.178 (wdl_htWin) - 0.147 (ftg_diff) + 0.097 (htg_diff) +0.046 (s_diff) - 0.020 st_diff) - 0.020 (f_diff) +0.019 (c_diff) - 0.005 (y_diff) + 0.226 (r_diff).

Important Predictor Variables

The absolute P-Value of each predictor variable is shown in the last column of the Coefficients’ Estimate as (Pr(>|z|)). The Significance Code for each absolute P-Value indicates its significance level. The Null Hypothesis tests the claim that the co-efficient is = 0 and consequently not important in predicting the outcome. At a significance level of 0.05, there are two important predictor variables. The Shot Difference (s_diff) is important (p = 0.011, significant at 0.01 level). The Win at Full Time (wdl_ftWin) is also important (p = 0.061; significant at 0.05 level). Since the significance level of the other co-efficients exceeds 0.05, they are considered unimportant predictors of ‘Home Team’ and ‘Away Team’.

Impact of Important Predictor Variables on Odds of Team being classified as Home Team

A positive coefficient sign indicates that the variable in question, contributes positively to the odds of a team being predicted (classified) as a Home Team. Both statistically significant coefficient estimates are positive. Therefore, for every unit increase in difference in shots made (s_diff), the odds of being a Home Team is multiplied by e^0.046 = 1.047 and the odds of the team being a Home Team increases by 4.7%. If a team Wins at Full Time (wdl_ftWin), the odds of being a Home Team is multiplied by e^0.563 = 1.756 or the odds of the team being a Home Team increases by 75.6%.

Accuracy of Binary Logistic Regression on Training Data

      Predicted
Actual Away Home
  Away  194  108
  Home  119  179

Drawing on the training data, the Confusion Matrix shows that 373 (194 + 179) out of the 600 games were correctly classified by the model, which is equivalent to 62.2% overall Accuracy shown in [1] below.

Drawing on the training data, the Binary Logistic Regression Model correctly classified 179 Home Teams out of an Actual total of Home teams of 298 (179 + 119) or 60.1% and 194 Away teams out of an Actual total of 302 Away Teams (194 + 108) or 64.2%.

[1] "62.2%"

Accuracy of Binary Logistic Regression on Testing Data

      Predicted
Actual Away Home
  Away   53   25
  Home   30   52

Drawing on the test data, the Confusion Matrix shows that 105 (53 + 52) out of 160 games were correctly classified by the model, which is equivalent to 65.6% overall Accuracy as shown in [1] below.

Drawing on the test data, the Binary Logistic Regression Model correctly classified 52 Home Teams out of an Actual total of 82 (52 + 30) Home teams or 63.4%.

Drawing on the test data, the Binary Logistic Regression Model correctly predicted 53 Away Teams out of an Actual total of 78 (53 + 25) Away teams or 67.9%.

[1] "65.6%"

Conclusion

The overall accuracy of the Binary Logistic Regression Model was 62.2% when we used the training data and 65.6% when we used the testing data. This suggests that the model was more accurate when using testing data and that the training model did not overfit the data.

Drawing on the training data, the Binary Logistic Regression Model correctly predicted the Home Team and Away Team 60.1% and 64.2% of the time, respectively, which suggested that the model generalised well.

Drawing on the testing data, the Binary Logistic Regression Model correctly predicted the Home Team and Away Team 63.4% and 67.9% of the time, respectively, again suggesting that the model generalised well.

Compare and Contrast Classification Tree and Binary Logistic Regression Models

Model Accuracy

The performance of the two models, in terms of overall accuracy and accuracy in predicting Home and Away Teams on the Training and Testing data is summarised in Table 1

Table 1. Comparing Accuracy of Classification Tree and Binary Logistic Regression

Model	Training	Testing	Training Classify Home Team	Training Classify Away Team	Testing Classify Home Team	Testing Classify Away Team
ClassificationTree	65%	63%	70%	61%	71%	55%
Binary Logistic Regression	62%	66%	60%	64%	63%	68%

The Classification Tree was 3% more accurate than the Binary Logistic Regression Model with the training data (65% v 62%) but 3% less accurate with the testing data (63% v 66%), suggesting that the Logistic Regression Model was more desirable. As there was little discrepancy between accuracy on the training and testing data, there was no evidence of over-fitting by the models on the Training data. The Classification Tree was eight percentage points more accurate (71% v 63%) in predicting the Home Team with the testing data but thirteen percentage points less accurate (55% v 68%) in predicting the Away Team on the testing data, thus indicating a better performance all around by the Binary Regression Model.

The Binary Logistic Regression Model identified two significantly important variables in the prediction of the Home Team.

Overall Conclusion

The Logistic Regression Model was superior to the Classification Tree Model, in terms of Overall Accuracy with the Testing Data, and in terms of correctly classifying Home Teams and Away Teams with the Testing Data.

While previous research identified that, on average, 62% of wins are by Home Teams, this study showed that the Home Team can be accurately classified 63% to 71% of the time, and the Away Team 55% - 68% of the time, depending on whether the Classification Tree or the Binary Logistic Regression Model is used.

Using the training and testing data, neither model indicated over-fitting, in terms of overall accuracy, with the Binary Logistic Regression model performing better with the testing data. The Binary Regression Model provided a more even (63% to 68%) accuracy when predicting Home Teams and Away Teams on the Testing data compared to the Classification Tree (71% to 65%).

In terms of parsimony, the Logistic Regression Model identified two statistically significant variables which contributed to the correct classification of Home Teams and Away Teams. The Classification Tree used three variables whose importance was approximately similar around 19% and another whose importance was 13% and three whose importance was between 8% and 9%.

The Logistic Regression Model allowed the calculation of the percentage increase in correct classification arising from a unit increase in each of the two predictor variables which allowed the calculation of ‘odds ratio’ impact.

Overall, the Binary Logistic Regression Model displayed more desirable characteristics.

References

Lago-Ballesteros, J., Lago-Penas, C. (20112). Game Location and Team Quality Effects on Performance Profiles in Professional Soccer. Journal of Sports Science and Medicine, 10, 465-471.

Pollard, R. & Pollard, G. (2005). Longterm Trends in Home Advantage in Professional Team Sport in North America and England (1876 - 2003). Journal of Sports Sciences 23(4). 337-350

Pollard, R. (2006) Home Advantage in Soccer:Variations in its Magnitude and a Literature Review of the Inter-Related Factors Associated with its Existence. Journal of Sports Behaviour. 29, 169-189.