Classification & Clustering Assignment
Part A - Classifying NBA Players
Classification Tree Interpretation
One Rule For Predicting If A Player Is Classified As Being In The Top 50
If a player’s average field goals per game is less than 7.1, their average three-pointers made for a game is less than 2.3 and they have a total rebound percentage of greater than 6.4, along with an average of over 0.95 blocks per game, the player would be predicted as part of the Top 50. This leaf node has a purity of 56%. This means 56% of players in this group are in the Top 50, indicating that this rule is relatively week.
One Rule For Predicting If A Player Is Not Classified As Being In The Top 50
If a player’s average field goals per game is less than 7.1, their average three-pointers made for a game is less than 2.3, have a total rebound percentage of less than 6.4, the player would be predicted as not in the Top 50. This leaf node would be considered highly pure with 99% in this group not included in the Top 50.
Most Important Variables In Predicting If A Player Is/Is Not In The Top 50
The three most important variables for predicting if a player is inside or outside the Top 50 are FG (average field goals per game), THR (average three-pointers made per game) and TOV (average turnovers per game). FG has the highest variable importance value at 38. This is followed by THR at 14 and TOV at 13.
Accuaracy of Classification Tree
Confusion Matrix: Actual vs Predicted Top 50 Status Training Data
Predicted
Actual N Y
N 91 8
Y 9 22
From the above confusion matrix, we know the following:
- The overall model accuracy is (91+22)/130 = 0.87 or 87%.
- Of all players the model predicted a player would not be in the Top 50, they got 91/100 or 91% correct.
- Of all players the model predicted a player would be in the Top 50, they got 22/30 or 73% correct.
- Of all players where a player was actually not in the Top 50, the model correctly identified 91/99 or 92%.
- Of all players where a player was actually in the Top 50, the model correctly identified 22/31 or 71%.
Confusion Matrix: Actual vs Predicted Top 50 Status Testing Data
Predicted
Actual N Y
N 48 4
Y 3 11
From the above confusion matrix, we know the following:
- The overall accuracy for the testing data set is (45+11)/63 or 89%.
- Of all players the model predicted would not be in the Top 50, they got 48/51 or 94% correct.
- Of all players the model predicted would be in the Top 50, they got 11/15 or 73% correct.
- Of all players where a player was actually not in the Top 50, the model correctly identified 48/52 or 92%.
- Of all players where a player was actually in the Top 50, the model correctly identified 11/14 or 79%.
Is The Classification Tree Overfitting The Training Dataset?
No, the classification tree does not seem to be overfitting the training data set. The training accuracy is 87%, while the testing accuracy is 89% which is slightly higher. This suggests that the model generalise well to unseen data. Since the test accuracy is slightly higher and there is no drop in performance, the model does not appear to be overfitting.
Binary Logistic Regression
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.1497205 6.9686447 -3.1784833 0.001480478
FG 1.1145993 0.4702424 2.3702655 0.017775316
FGP 8.0568953 37.1401756 0.2169321 0.828261282
THR 1.8564663 1.3790134 1.3462279 0.178229024
THRP -3.9736297 5.4204592 -0.7330799 0.463509696
EFG 13.3516816 37.2100634 0.3588191 0.719730420
TRB 0.1160246 0.1745557 0.6646853 0.506251822
AST 0.3559453 0.3236127 1.0999114 0.271370720
STL 1.2273097 0.8778264 1.3981235 0.162075997
BLK 1.4372892 0.9467220 1.5181745 0.128970412
TOV -1.2970867 0.9666922 -1.3417783 0.179667888
PF 0.2190881 0.8587229 0.2551324 0.798620805
Regression Equation
ln(π/1π) = -22.15 + 1.115.FG + 8.057.FGP + 1.856.THR - 3.974.THRP + 13.352.EFG + 0.116.TRB + 0.356.AST + 1.227.STL + 1.437.BLK - 1.297.TOV + 0.219.PF
Important/Not Important Variables In Predicting If A Player Is In The Top 50
H0: the coefficient for the predictor variable = 0, is not important in predicting if a player is/is not in the Top 50. Ha: the coefficient for the predictor variable ≠ 0, is important in predicting if a player is/is not in the Top 50.
The most important predictor variable for classifying if a player is in the Top 50 is field goals per game (FG), which has a p-value of 0.02. This is less than the significance level of 0.05, which means FG is considered statistically significant.
All other variables including FGP, THR, THRP, EFG, TRB, AST, STL, BLK, TOV and PV have p-values greater than 0.05 which means they are not statistically significant predictors in this model.
Calculate & State The Impact That Variable Has On The Odds Of A Player Being In The Top 50
e^coefficient e^FG = e^1.1145993 = 3.05
An increase of 1 field goal per game by a player multiplies the odds of them being included in the Top 50 by 3.05.
Accuaracy of the Binary Logistic Regression Model
Confusion Matrix: Actual vs Predicted Top 50 Status Training Data
Predicted
Actual N Y
N 94 5
Y 8 23
From the above confusion matrix, we know the following:
- The overall model accuracy is (94+23)/130 = 0.9 or 90%.
- Of all the players the model predicted would not be in the Top 50, they got 94/102 or 92% correct.
- Of all the players the model predicted would be in the Top 50, they got 23/28 or 82% correct.
- Of all players where a player was actually not in the Top 50, the model correctly identified 94/99 or 95%.
- Of all players where a player was actually in the Top 50, the model correctly identified 23/31 or 74%.
Confusion Matrix: Actual vs Predicted Top 50 Status Testing Data
Predicted
Actual N Y
N 52 0
Y 4 10
From the above confusion matrix, we know the following:
- The overall model accuracy is (52+10)/66 = 0.94 or 94%.
- Of all the players the model predicted would not be in the Top 50, they got 52/56 or 96% correct.
- Of all the players the model predicted would be in the Top 50, they got 10/10 or 100% correct.
- Of all players where a player was actually not in the Top 50, the model correctly identified 52/52 or 100%.
- Of all players where a player was actually in the Top 50, the model correctly identified 10/14 or 71%.
This model appears to work quite well for the testing dataset.
Accuaracy of Models
[1] 0.8692308
[1] 0.8939394
[1] 0.9
[1] 0.9393939
Which Model Is More Accurate?
The binary logistic regression model performs more accurately on both the training and testing datasets compared to the classification tree model. It has the highest testing accuracy (0.94/94%) compared to the classification tree model (0.89/89%). This suggests it provides a better overall fit and generalises more to unseen data.
How Do The Models Compare In Terms Of Variables That Are Considered Important Predictors Of The Target Variable?
The classification tree model considered FG (average field goals per game), THR (average three-pointers made per game) and TOV (average turnovers per game) as important predictors of the target variable (Top 50). The binary logistic regression model only considered FG as an important predictor of the target variable (Top 50).
Part B - Clustering Soccer Players
Hierarchial Clustering
Does The Data Need To Be Scaled
No, the data does not need to be scaled as all variables are measured on the same scale of 1-100. This meant scaling the data was not necessary.
Does The Heatmap Provide Evidence Of Any Clustering Structure Within The Dataset
If clusters are present in the data, we should see light coloured blocks around the diagonal. In the above heatmap, we can see an obvious clustering block in the top left corner. There is another block visible in the bottom right corner of the heatmap. This shows that there may be some clustering within the data
Create A 4-Cluster Solution & Assess The Quality Of This Solution
Silhouette of 1000 units in 4 clusters from silhouette.default(x = clusters1, dist = fifa1) :
Cluster sizes and average silhouette widths:
492 193 107 208
0.30592133 0.28830631 0.69464229 0.08230691
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3653 0.1461 0.3160 0.2976 0.4576 0.7958
The overall cluster analysis has a mean Silhouette Score of 0.2976, which means the analysis has uncovered some weak structure:
- Cluster 4 has the weakest score of 0.08, meaning that this Cluster is weak.
- Cluster 2 has a score of 0.29 which is also considered weak.
- Cluster 1 has a score of 0.30 which would also be considered weak.
- Cluster 3 has the strongest score of 0.69, meaning that this is a strong Cluster.
How Do The Clusters Differ On Their Average Performance For The Following 6 Attributes?
# A tibble: 4 × 7
Cluster acceleration ball_control dribbling shot_power short_passing
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1 80.8 81.1 80.4 77.0 77.9
2 C2 65.2 79.0 73.9 77.2 79.4
3 C3 48.5 23.7 16.1 25.1 33.0
4 C4 57.0 66.3 55.8 63.3 70.3
# ℹ 1 more variable: sprint_speed <dbl>
Each of the clusters show differences in average scores across the attributes. Cluster 1 has the highest values across all attributes other than shot power, indicating the strongest performance overall. Cluster 2 has above-average scores, with a highest score of shot power. Cluster 4 has below-average values, indicating weaker performances. Cluster 3 has the lowest values across all attributes, showing the weakest players. Each cluster represents a gradient from high-performing to low-performing players.
How Do The Clusters Differ On Age, Club, Value And Wage?
# A tibble: 4 × 4
Cluster age value wage
<chr> <dbl> <dbl> <dbl>
1 C1 26.3 21355081. 79226.
2 C2 28.1 16098446. 65772.
3 C3 29.1 14350935. 51766.
4 C4 28.0 13387500 58260.
Cluster 1 contains the youngest players on average and also has the highest average value and wage, indicating higher-quality players. Cluster 2 has slightly older players with moderately high value and wages. Clusters 3 and 4 consist of older players with lower average values and wages and Cluster 4 has the lowest wages overall. This suggests a there is a pattern where younger players tend to have higher market value and earnings compared to older players in this dataset.
Age
Each has one main peak with the values gradually increasing moving from C1-C4.C1 has the lowest numbers with C4 recording the highest. Also, C3-C4 have a more spread out values which means these clusters are less tightly clustered and more varied.
Wage
Most values in this distribution are clustered very close to zero, with few much larger values visible particularly in C2 and C4. C3 appears to have the largest cluster closest to zero, with C2 showing the largest spread of values overall.
Value
The boxplot shows C1 generally has the highest value of players while C3 appears to have the lowest. C2 and C4 sit in between. All groups have a significant amount of outliers, meaning there are some very large values compared to the rest. This is particularly obvious in C1, where there are large values visible and a wider spread. C3 is the most tightly grouped with lower values overall.
Clubs
# A tibble: 12 × 3
# Groups: Cluster [4]
Cluster club n
<chr> <chr> <int>
1 C1 FC Barcelona 15
2 C1 FC Bayern Munich 14
3 C1 FC Porto 14
4 C2 Arsenal 6
5 C2 Sevilla FC 6
6 C2 Southampton 6
7 C3 RCD Espanyol 3
8 C3 Ajax 2
9 C3 Arsenal 2
10 C4 Manchester United 6
11 C4 Real Sociedad 6
12 C4 Juventus 5
The clusters differ in terms of club representation. Cluster 1 includes players mainly from top clubs such as Borussia Dortmund and FC Barcelona. Cluster 2 is dominated by teams including FC Porto, Tottenham Hotspur and FC Barcelona again, showing quite a varied mix of strong teams. Cluster 3 shows teams such as Athletic Club de Bilbao, Manchester United and Real Sociedad.Finally Cluster 4 has fewer repeated clubs with RCD Espanyol, Ajax and Arsenal appearing in small numbers which suggests this cluster is more dispersed and not strongly associated with any particular clubs.
K-Means Clustering
Carry Out A K-means Clustering That Will Produce 4 Clusters and Assess The Quality of the Clustering Solution
Silhouette of 1000 units in 4 clusters from silhouette.default(x = kmeans1$cluster, dist = fifa1) :
Cluster sizes and average silhouette widths:
174 431 289 106
0.1777123 0.3802751 0.2177741 0.6883372
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1060 0.1807 0.3313 0.3307 0.4788 0.7898
The overall cluster analysis has a mean Silhouette Score of 0.3307, which means the analysis has uncovered some weak structure:
- Cluster 1 has the weakest score of 0.17, meaning that this Cluster is weak.
- Cluster 3 and Cluster 2 have scores of 0.22 and 0.4 respectively, which is also considered weak.
- Cluster 4 has a the strongest score of 0.65 which would be considered a reasonable clustering structure.
# A tibble: 4 × 7
Cluster acceleration ball_control dribbling shot_power short_passing
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1 58.2 64.5 53.3 60.3 68.7
2 C2 82.1 81.3 81.1 77.1 77.9
3 C3 65.0 78.6 73.6 76.9 79.0
4 C4 48.3 23.4 15.9 25.1 32.8
# ℹ 1 more variable: sprint_speed <dbl>
Each of the clusters show clear differences in average scores across the attributes. Cluster 2 has the highest values across all attributes indicating the strongest performance overall. Cluster 3 has above-average scores, with a highest score of shot power. Cluster 1 has below-average values, indicating weaker performances. Cluster 4 has the lowest values across all attributes, showing the weakest players. Each cluster represents a gradient from high-performing to low-performing players.
Age
Each has one main peak with the values. C2 appears to have the lowest numbers with C4 recording the highest. Also, C1, C3 and C4 seem to have more spread out values than C2, which means these clusters are less tightly clustered and more varied
Wage
Most values in this distribution are clustered very close to zero, with much larger values also visible, particularly in C2 and C3. C1 appears to have the largest cluster closest to zero, with C2 showing the largest spread of values overall.
Value
The boxplot shows C2 generally has the highest value of players while C1 appears to have the lowest. C3 and C4 sit in between. All groups have a significant amount of outliers, meaning there are some very large values compared to the rest. This is particularly obvious in C2, where there are large values visible and a wider spread. C1 is the most tightly grouped with lower values overall.
Clubs
# A tibble: 12 × 3
# Groups: Cluster [4]
Cluster club n
<chr> <chr> <int>
1 C1 Juventus 5
2 C1 Everton 4
3 C1 Lazio 4
4 C2 FC Barcelona 13
5 C2 Atlético Madrid 11
6 C2 Borussia Dortmund 11
7 C3 Sevilla FC 8
8 C3 Arsenal 7
9 C3 Borussia Dortmund 7
10 C4 RCD Espanyol 3
11 C4 Ajax 2
12 C4 Arsenal 2
The clusters differ in terms of club representation. Cluster 1 includes players mainly from clubs such as CSKA Moscow, RC Celta de Vigo and Real Sociedad, which may not be considered top clubs. Cluster 2 is dominated by top teams including FC Barcelona, FC Porto and Juventus. Cluster 3 shows teams such as Arsenal, Borussia Dortmund and Everton. Finally Cluster 4 has fewer repeated clubs with RCD Espanyol, Ajax and Arsenal appearing in small numbers which suggests this cluster is more dispersed and not strongly associated with any particular clubs.
Which Algorithm Produced The Highest Quality Clusters?
Hierarchical clustering produced slightly higher-quality clusters than K-means clustering. This is shown by the higher silhouette scores, with hierarchical clustering reaching a higher score of approximately 0.78 than K-means clustering highest score being around 0.65. This means a stronger clustering structure was found using hierarchical clustering. It should be noted however, both clustering techniques produced reasonably similar results overall.
Did Both Algorithms Produce Clusters With A Similar Profile? Or Are There Any Noticeable Differences?
Hierarchical clustering and K-means clustering produced reasonably similar cluster profiles, with both groups identifying high-performing, moderate-performing and lower-performing players. However, there are noticeable differences in how players are assigned to clusters. The clusters are not exactly the same in both methods, so although the overall pattern is similar, some groups are still grouped differently.