Comparison of Classification and Clustering Techniques

1 Introduction

Throughout the duration of the Postgraduate Certificate in Sports Analytics and insights, I have learned a variety of techniques used in modern data analytics. Two of these techniques that we have covered are Classification and Clustering. The aim of this report is to investigate differences between techniques used in both Classification and Clustering, and to determine which methods are more accurate when applied to sports datasets.

2 Classification

The first technique that we will be applying is classification. The data that I will be using for the Classification techniques is taken from the 2017-18 NBA season. The goal is to use the techniques to try to predict whether or not a player makes it into ESPN’s Top 50 Players list for the 2018-19 season. The dataset contains data from 2017-18 on all NBA players with at least 1500 minutes played, along with a variable called top_50 indicating whether or not the player was listed in the top 50 in the ESPN list. The variables included in the dataset are shown below.

No	Variable	Description
1	Player	Name of player
2	Pos	Position. C = centre, PF = power forward, PG = point guard, SF = small forward, SG = shooting guard
3	FG	Average field goals made per game
4	FGP	Field Goal Percentage = % of field goal attempts that were successful
5	THR	Average three-points made per game
6	THRP	Three-pointers Percentage = % of Three Point attempts that were successful
7	EFG	Effective Field Goal Percentage: adjusts field goal % to account for the fact that three-point field goals count for three points while field goals count for two
8	TRB	Total rebound percentage = percentage of available rebounds a player grabbed while playing.
9	AST	Average assists per game
10	STL	Average steals per game
11	BLK	Average blocks per game
12	TOV	Average turnovers per game
13	PF	Average personal fouls per game
14	PTS	Average points scored per game
15	Top50	Binary variable indicating if player made the Top 50 or not.

There are two techniques of Classification that will be used during this report. Firstly we will use a Classification tree, followed by a Binary Logistic Regression. Both methods are widely used in predictive analytics.

2.1 Classification Tree

A Classification Tree is the graphical representation of a series of classification rules. Starting from a root node, which includes all cases, the tree branches into different nodes containing subgroups of cases. The splitting criterion, also known as branching criterion, is optimally determined after examining the values of all included predictor variables (Podgorelec et al, 2002). Below is a Classification Tree used to classify a player as being inside or outside the Top 50. Note that average points per game was not used in the model.

If we look at the two paths that end with the bottom right leaf, and the bottom left leaf, we find the leaves with the highest purity. If a player made 7.1 field goals or more per game, they would move to the bottom right leaf. This would mean that a players probability of making the Top 50 was 81%. This leaf has a purity of 0.81. On the other side of the tree, If a player made less than 7.1 field goals per game, less than 2.3 three pointers per game, and has a less than 6.4 rebound percentage, they would end up at the bottom left leaf. Their probability of making the top 50 is therefore 1%. This leaf has a purity of 0.99. This would suggest that the classification is extremely accurate at predicting when a player will not make the Top 50.

There are many variables on this classification tree, however which variables were the most important in splitting players? To determine this, I create a barplot below, identifying variable importance values for each variable.

As we can see from the barplot above, Average Field Goals (fg) was the most important predictor variable, accounting for 16 improvements in the model, which is 37% of all improvements made. Average three-points made per game (thr) accounted for 6 improvements made, affecting 14% of all improvements made. Average turnovers per game (tov) accounted for 5 improvements made, affecting 12% of total improvements made.

Although we have identified the important variables involved in predicting if a player made the Top 50, we do not know if the Classification Tree is an accurate model for predicting future players chances of making the Top 50 list. Below is a confusion matrix detailing the model’s accuracy on the testing dataset.

      Predicted
Actual  N  Y
     N 91  8
     Y  9 22

Observing the confusion matrix, we can see that the overall accuracy of the model was 87%.((22+91)/130 = 87%).

From the matrix, we can see that when the model predicted a player to be in the top 50, they were right 22/30 times (73%).

When the model predicted a player would not make the top 50, they were right 91/100 times (91%).

Where a player did make the top 50, the model correctly identified 22 of the 31 players. (71%).

Where a player did not make the top 50, the model correctly identified 91 of the 99 players (92%).

Overall, the model has a high level of accuracy, especially when predicting that a player would not make the Top 50. This data however is from the testing dataset. Would the model also be accurate when applied to the training dataset? The accuracy of the model on the training dataset was tested using the same format below.

      Predicted
Actual  N  Y
     N 48  4
     Y  3 11

We can see that the model had an 89% overall accuracy when applied to the training dataset. ((48 + 11)/66 = 89%).

When the Classification Tree model predicted a player to be in top 50, it was right 11/15 times (73%).

When the model predicted that a player would not be in the top 50, it was right 48/51 times (94%).

Where a player did make the Top 50, the model correctly predicted 11 of the 14 players (79%) .

Where a player did not make the Top 50, the model correctly predicted 48 of the 52 players (92%).

2.1.1 Summary

The Classification Tree model had a high level of accuracy when it came to predicting if a player would make the Top 50 list or if they would not make the Top 50 list. The model was especially strong at identifying those who would not make the list. The model produced higher levels of accuracy on the training data compared to the testing data (89 vs 87). This would suggest that the tree is not overfitting, and would be a reliable model to use for future predictive analysis.

2.2 Binary Logistic Regression

The binary logistic regression model is part of a family of statistical models called generalised linear models. The main characteristic that differentiates binary logistic regression from other generalised linear models is the type of dependent (or outcome) variable (Harris 2020). A dependent variable in a binary logistic regression has two levels. In this study for example, the outcomes of a player being selected in the Top 50 list are Yes or No. Below is the summary of a Binary Logistic Regression model that was applied to the testing data.


=============================================
                      Dependent variable:    
                  ---------------------------
                            top_50           
---------------------------------------------
fg                           1.115           
                           p = 0.018         
                                             
fgp                          8.057           
                           p = 0.829         
                                             
thr                          1.856           
                           p = 0.179         
                                             
thrp                        -3.974           
                           p = 0.464         
                                             
efg                         13.352           
                           p = 0.720         
                                             
trb                          0.116           
                           p = 0.507         
                                             
ast                          0.356           
                           p = 0.272         
                                             
stl                          1.227           
                           p = 0.163         
                                             
blk                          1.437           
                           p = 0.129         
                                             
tov                         -1.297           
                           p = 0.180         
                                             
pf                           0.219           
                           p = 0.799         
                                             
Constant                    -22.150          
                           p = 0.002         
                                             
---------------------------------------------
Observations                  130            
Log Likelihood              -33.260          
Akaike Inf. Crit.           90.520           
=============================================
Note:             *p<0.1; **p<0.05; ***p<0.01

After determining the coefficients of each variable, we can now create a regression equation. The regression equation allows us to predict the probability of a new future case falling into one of the outcomes (Yes or No):

ln(π/(1-π))= -22.150 + 1.115.fg + 8.057.fgp + 1.857.thr - 3.974.thrp + 13.352.efg + 0.116.trb + 0.356.ast + 1.227.stl + 1.437.blk - 1.297.tov + 0.219tov.

Another thing to note from the Binary Logistic Regression, is whether or not a variable is significant in determining the outcome of player being in the Top 50 or not. We can do this by observing the p value of a variable. Average Field Goals per game (fg) is identified as the most important predictor variable in determining if a player reaches the NBA Top 50. As we can see from the model, fg has a p value of 0.0178. This means we that we reject H0 (p > 0.05), and confirm that fg is a significant variable for classifying a player as being in the Top 50 list.

Although neither variable is significant in classifying a player as being in the Top 50 list, it is interesting to note that both Three-pointers percentage (thrp), and Average turnovers per game (tov) had a negative effect on the outcome. This means that as these numbers increased, the chances of a player making the Top 50 decreased. But by how much did fg positively improve a players chances of making the Top 50, and both thrp and tov worsen a players chances of making the Top 50?

Using the coefficient of each variable, along with the exponential function (exp), we can determine how much a players chances increased or decreased when there was a 1 unit increase in that variable. Through this function it was discovered that if a player can increase their Average Field Goal per game (fg) by 1, their chance of making the top 50 multiply by 3.05. The odds of this happening therefore increase by 205%. (3.05−1) × 100 = +205%. This shows that increasing field goals can have a large impact on a players chances of making the Top 50.

Interestingly, if a player experienced an increase of 1% in Three-pointers percentage (thrp), a players chances of reaching the top 50 decreased by 98.1%. (0.019-1) x 100 = -98.1%. Similarly to this, an increase of 1% in Average turnovers per game (tov), results in the chances of a player reaching the top 50 decreasing by 73%. (0.27-1) x 100 = -73%. It is important to note again that these variables were not significant in classifying a player as making the Top 50 or not, they were just an interesting observation.

As was the case with the Classification Tree, we have identified the important variables involved in predicting if a player made the Top 50, however we do not know if the Binary Logistic Regression is an accurate technique for predicting future Top 50 players. Below is a confusion matrix detailing the models accuracy on the testing dataset.

      Predicted
Actual  N  Y
     N 94  5
     Y  8 23

Observing the confusion matrix, we can see that the overall accuracy of the model was 90% ((94+23)/130 = 90%).

From the matrix, we can see that when the model predicted a player to be in the top 50, they were right 23/28 times (82%).

When the model predicted a player would not make the top 50, they were right 94/102 times (92%).

Where a player did make the top 50, the model correctly identified 23 of the 31 players (74%).

Where a player did not make the top 50, the model correctly identified 94 players out of 99 (95%).

Similar to the Classification tree, this model produces high accuracy when applied to the training dataset. Will it have the same accuracy on the testing dataset?

      Predicted
Actual  N  Y
     N 52  0
     Y  4 10

We can see that the Binary Logistic Regression model had a 94% overall accuracy when applied to the training dataset. ((10+52)/66 = 94%).

When the model predicted a player to be in top 50, it was right 10/10 times (100%).

When the model predicted that a player would not be in the top 50, it was right 52/56 times (92.9%).

Where a player did make the Top 50, the model correctly predicted 10 of the 14 players (71%) .

Where a player did not make the Top 50, the model correctly predicted 52 of the 52 players (100%).

2.2.1 Summary

The Binary Logistic Regression model was highly accurate when used to classify whether future players would make the NBA Top 50 list or not. It’s accuracy was 94% when applied to the testing data, which was higher than the application to the training dataset, suggesting it has strong predictive capabilities. It might be worth noting that it only identified 71% of the players who did make the Top 50, meaning it has only moderate recall.

2.3 Comparison of Classification Techniques

Both Classification techniques showed a high level of accuracy, with the classification tree showing a predictive accuracy of 89%, and the binary logistic regression showing a predictive accuracy of 94%. This would suggest that the binary logistic regression model is a more accurate classification model for this dataset. As previously mentioned, both models were more accurate at predicting the negative outcome, when a player did not make the Top 50. The number of players that did not make the Top 50 is much larger than the number of players that did make the Top 50, ensuring that these models had a high overall accuracy, mainly due to the contribution of the negative outcome.

According to the classification tree, Average field goals per game (16%), Average three-points made per game (6%), and Average turnovers per game (5%) were the most important predictor variables. This is evident in the tree where we can see Average field goals per game (fg), and Average three-points made per game (thr) presenting as the two highest nodes. When looking at the binary logistic regression, fg is the only variable that is significant in predicting a players ability of reaching the Top 50. In the binary logistic regression, thr and tov were not found to be significant, and interestingly, tov was found to have a negative affect on the outcome. It can therefore be determined that Average field goals per game (fg) was the most important predictor variable for both models.

3 Clustering

There will be a different dataset used for comparing our two different Clustering techniques. The dataset contains data from 1000 soccer players based on information from FIFA, and includes their age, wage, nationality, overall rating, club, value, and wage, as well as 34 different attributes. For the purpose of this report we will be clustering players based on the following attributes:

acceleration.
ball_control.
dribbling.
shot_power.
short_passing
sprint_speed.

The two Clustering techniques that will be compared are Hierarchical Clustering, and Kmeans Clustering. Note that the data does not need to be scaled for either Clustering technique, as all variables are scored out of 100.

3.1 Hierarchical Clustering

Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters, where clusters are merged and split step by step based on applied similarity measure (Oti et al, 2024). To complete a Hierarchical clustering, the euclidean distance must be calculated to determine the distance between all pairs of players. When the Hierarchical clustering is then carried out, it is important to be able to visualise any clusters that might exist. Below is a dendrogram and a heatmap that I created to determine if there is any evidence of clustering among the players.

If we observe the heatmap, we can see that the most obvious example of clustering is in the top left corner, where there is a block of light colours. These light colours would suggest that there are small distances between these groups of players, thus they are similar players. Observing the rest of the heatmap, there does not seem to be any other evidence of clusters along the diagonal.

A 4-cluster solution was identified as creating the best quality of clusters. Below is a summary of the Silhouette Scores for the dataset after a 4-cluster solution was applied.

Silhouette Summary
Cluster	Silhouette_Score
1	0.306
2	0.288
3	0.695
4	0.082
Overall	0.298

If we look at the Mean value of the overall data, the cluster analysis has a score of 0.2976. This means that it has a weak structure, and there is no strong evidence of clustering. If we look at individual clusters, Cluster 3 has the highest silhouette score of 0.7, suggesting a reasonable to strong structure of clustering.Clusters 1 and 2 have scores of 0.31, and 0.29 respectively. This suggests that there is a weak clustering structure among these groups of players. Cluster 4 has a score of 0.08, which means no clustering structure was found.

The interesting part of clustering, is that we can determine the characteristics of each cluster, ie what are the attributes that define a player as belonging to a specific cluster. Below is a linegraph and a radarchart, detailing the attributes of each cluster.

As we can see from the above charts, the players in Cluster 1 are the highest performing players in all but one attribute, short passing. They are 80 and above in acceleration, sprint speed, dribbling, and ball control, outperforming the other clusters in these attributes. The players in Cluster 2 are above average in all attributes, however score much lower than Cluster 1 in acceleration and sprint speed. They score between 70 and 80 for dribbling, ball control, short passing, and shot power, signifying that they are the cluster with the second best mean attributes. Cluster 3 contains the players with the lowest mean attributes of all clusters. The players were around 50 or just below for acceleration and sprint speed, however they were rated significantly lower for dribbling, ball control, short passing, and shot power. Cluster 4 players were above average for all attributes. Their mean scores for acceleration, sprint speed, and dribbling were below 60, however they were stronger in attributes ball control, short passing, and shot power.

It is clear to see that there is a difference in hierarchy between the clusters. Is this affected by any other factors? I was interested to compare each cluster by their age, wage, and value, to see if these were a reflection of how the clusters performed.

Comparison of Clusters
Cluster	Age	Value (€)	Wage (€)
1	26.31	21355081	79225.61
2	28.07	16098446	65772.02
3	29.10	14350935	51766.36
4	28.03	13387500	58259.62

Observing the table above, we can see that Cluster 1 contains the players that are valued the highest, and have the highest wages. This is unsurprising as we discovered earlier that they are the players with the highest average attributes. This is also true for the players in Cluster 2, who were the players with the second highest average attributes, thus being reflected in their wage and value. The players in Cluster 3 had the lowest average attributes by a considerable amount. Interestingly, they are valued higher than Cluster 4. This might suggest that these player play in positions where the chosen attributes are not as important, for example a goalkeeper or central defender.

3.1.1 Summary

The Hierarchical clustering produced a weak structure of clustering overall. When the players were assessed in a 4-cluster solution, there was a moderate to strong structure of clustering found in Cluster 3. The other individual Clusters consisted of weak structures, or no structures.

3.2 Kmeans Clustering

The second clustering technique that we will be applying is Kmeans clustering. The objective of Kmeans clustering is to minimize the sum of distances between data points and their assigned clusters. Data points that are nearest to a centroid are grouped together within the same category. Due to the Kmeans algorithm requiring distance to perform is calculations, I do not need to calculate the euclidean distance myself, as it will be done in the algorithm. Similarly to the Hierarchical clustering, we will be using the Silhouette Scores below to identify any evidence of clustering.

Silhouette Summary
Cluster	Silhouette_Score
1	0.178
2	0.380
3	0.218
4	0.688
Overall	0.331

Observing the summary, the Mean Silhouette Score is 0.33, confirming that there is a weak clustering structure evident overall. If we observe the individual clusters, Cluster 4 has a Silhouette Score of 0.69, which suggests a moderate to strong clustering structure. There is a weak clustering structure evident in Cluster 2, with a score of 0.38, while Clusters 1 and 3 have no evidence of any clustering structure with scores of 0.18 and 0.22 respectively.

I was interested to see if the players in the clusters showed similar characteristics to the clusters identified in Hierarchical clustering, so I decided to create a similar linegraph and radarchart.

If we observe the above charts, we can see that the players in Cluster 1 were above 50 for all attributes. Their mean scores for acceleration and dribbling were between 50 and 60, and their sprint speed, ball control, short passing, and shot power all had a mean value of 60 or higher. The players in Cluster 2 are the highest performing players in all but one attribute, short passing. They are above 80 in acceleration, sprint speed, dribbling, and ball control, outperforming the other clusters in these attributes. The players in Cluster 3 are above average in all attributes, however score much lower than Cluster 2 in acceleration and sprint speed. They score between 70 and 80 for dribbling, ball control, short passing, and shot power, signifying that they are the cluster with the second best attributes. Cluster 4 contains the players with the lowest mean attributes of all the clusters. The players were below 50 for all attributes. Their average scores were close to 50 for acceleration and sprint speed, however they were rated significantly lower for dribbling, ball control, short passing, and shot power.

I was once again interested to see if there was any difference between clusters in relation their age, value, and wage, to determine if their attributes were a factor in relation to these variables.

Comparison of Clusters
Cluster	Age	Value (€)	Wage (€)
1	27.73	13317241	58540.23
2	26.17	21964037	80707.66
3	28.12	15971626	65224.91
4	29.04	14475000	51971.70

Unsurprisingly, we can see that Cluster 2, who have the highest mean attributes overall, are valued at the highest, as well as having the highest wages. These players are also the youngest players, with an average age of 26.17. There is a large drop-off in value and wage to Cluster 3, who are the closest cluster in terms of mean attributes. This drop-off might be viewed as unfair, as their mean attributes are close to Cluster 2 overall. Clusters 1 and 4 are quite balanced in terms of value and wage, which might feel unfair to Cluster 1 as they outperform Cluster 4 in every attribute. As mentioned earlier in the report, this might be due to the players in Cluster 4 requiring different attributes in their playing positions.

3.2.1 Summary

The Kmeans clustering technique discovered only one cluster of note. This was the players in Cluster 4, where there was evidence of a moderate to strong clustering structure. Overall, there was weak clustering structures among these players for the chosen attributes.

3.3 Comparison of clustering techniques

The hierarchical clustering algorithm discovered an overall clustering of 0.298, while the Kmeans clustering discovered an overall clustering of 0.33. The Kmeans therefore discovered slightly stronger evidence of clustering overall, however both were weak clustering structures. Within both clustering techniques, the highest score of individual clusters were very similar, 0.70 in hierarchical, and 0.69 in kmeans respectively. This means that for the two techniques, they could not successfully identify a strong clustering structure, with both of these individual clusters showing moderate to strong structures. The hierarchical clustering algorithm also discovered evidence of weak clustering structures in two of its clusters, while the kmeans algorithm could find a weak clustering structure in only one of its individual clusters. Despite the differing results of the two clustering techniques, it was interesting to observe the characteristics of the clusters through the graphs and charts. If we compare the linegraphs and radarcharts of both clustering techniques, it is clear to see that the clusters produced almost identical profiles across the two techniques.

4 Conclusion

When analysing and evaluating data, it is important to use different techniques such as classification and clustering. With many different techniques and algorithms readily available for both classification and clustering, it is important to apply a variety of techniques if you would like to obtain the most accurate results for your data.

5 References

Podgorelec, V.; Kokol, P.; Stiglic, B.; Rozman, I. Decision trees: An overview and their use in medicine. J. Med. Syst. 2002, 26, 445–463
Harris JK. Statistics with R: solving problems using real-world data. SAGE Publications, 2020.
Oti, Eric U., Olusola, Michael O. (2024), Overview of Agglomerative Hierarchical Clustering Methods. British Journal of Computer, Networking and Information Technology 7(2), 14-23