Assignment 2

Assignment 2 - Sports Analytics & Insights - Kian Clancy

Part A - NBA Players

Question 2.A

Creating classification tree for classifying a player in the top 50 players in the NBA

      thr       trb       tov       ast       fgp       pos      thrp       blk 
11.569930  7.645455  6.808067  5.040502  3.312402  3.113392  3.029091  2.731120 
      efg       stl 
 1.964545  1.584301

[1] 0.8923077

[1] 0.8636364

Question 2.B

2.B(i)

A player is in in the top 50 classification if they make at least 2.3 three pointers in every game and end up with less than 1.7 turnovers per game. The node has a high purity score with 92% of players in the node in the top 50.

2.B(ii)

A player is not in the top 50 players in the classification if they make less than 2.3 three pointers in a game and have less than 7.3 rebounds in the game. The node is pure with 93% of players classified correctly as not being in the top 50.

2.B(iii)

Three point throw was one of the most important variables with a 11.57 importance value with rebound next with 7.65 and then turnover with a 6.81 importance value. These are the most important and top three variables for deciding if a player was in the top 50 or not.

Question 2.C

This has already been done above with training accuracy being 89.23% and testing accuracy being 86.36%.

2.C(i)

There is no signs of overfitting in the training dataset due to the low difference in accuracy report.This meaning that the model performs well on both training and testing datasets.

Question 3.A


Call:
glm(formula = top_50 ~ efg + fgp + thrp + ast + thr + trb + stl + 
    blk + tov + pos, family = binomial, data = nbatraining)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -13.98330    5.66266  -2.469  0.01353 *  
efg         -77.75734   29.50009  -2.636  0.00839 ** 
fgp          92.81359   28.42844   3.265  0.00110 ** 
thrp         -0.25498    4.51583  -0.056  0.95497    
ast           0.52938    0.36975   1.432  0.15223    
thr           4.90228    1.20359   4.073 4.64e-05 ***
trb           0.27080    0.23333   1.161  0.24580    
stl           1.73245    1.07118   1.617  0.10581    
blk           1.27239    0.88599   1.436  0.15096    
tov          -1.13680    0.87360  -1.301  0.19317    
posPF        -1.65320    1.20805  -1.368  0.17116    
posPG        -2.00017    1.97774  -1.011  0.31185    
posSF        -0.08615    1.64064  -0.053  0.95812    
posSG        -0.10205    1.69594  -0.060  0.95202    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 142.818  on 129  degrees of freedom
Residual deviance:  68.548  on 116  degrees of freedom
AIC: 96.548

Number of Fisher Scoring iterations: 7

3.A(ii)

\[ \log\left(\frac{p}{1-p}\right) = -13.98 - 77.76(\text{efg}) + 92.81(\text{fgp}) - 0.25(\text{thrp}) + 0.53(\text{ast}) + 4.90(\text{thr}) + 0.27(\text{trb}) + 1.73(\text{stl}) + 1.27(\text{blk}) - 1.14(\text{tov}) - 1.65(\text{posPF}) - 2.00(\text{posPG}) - 0.09(\text{posSF}) - 0.10(\text{posSG}) \]

3.A(iii)

The important predictor variables are three pointers made, field goal percentage and effective field goal percentage as all their p values are <0.05. thr = <0.001 fgp = 0.0011 efg = 0.0084 These play a key role in deciding whether a player is in the top 50 or not.

3.A(iv)

thr has a coefficient value of 4.90228 fgp has a coefficient value of 92.81359 efg has a coefficient value of -77.75734

exp(4.90228)

[1] 134.5963

exp(92.81359)

[1] 2.03437e+40

exp(-77.75734)

[1] 1.699872e-34

thr shows a 134.6 % chance of making the top 50 with every three pointer made, however this is inflated due to a lot of other metrics being included. fgp is also the same where the odds ratio is very large which suggests a positive relation of being in top 50. As opposed to efg the odds ratio is alost 0 which means there is a negative relationship.

Question 3.B

      Predicted
Actual  N  Y
     N 94  5
     Y  9 22

[1] 0.8923077

      Predicted
Actual  N  Y
     N 50  2
     Y  5  9

[1] 0.8939394

The model does very well with no major difference being reported with either testing or training data. This shows there there was very few misclassifications. The model achieved 89.23% on the training datset and 89.39% on the testing dataset.

Question 4.A

Comparing both the classification tree and the binary logistic regression model based on their accuracy proved interesting results. The classification tree provided a 89.23% on the training dataset and 86.36% on the testing dataset. The binary logistic regression model performed slightly better with 89.23% on the training dataset and 89.39% on the testing dataset.

Question 4.B

The classification tree provided three pointers made, rebounds and turnovers as the key variables for predicting the top 50. The logistic regression model provided three pointers made, field goal percentage and effective field goal percentage as the significant variables for predicting the top 50.

Part B - Soccer Players

Part B Question 2

Yes the data needs to be scaled before, this is done to ensure all of the variables are equally contributing to the clustering process and none of them dominate.

Part B Question 3.A

Part B Question 3.B


Call:
hclust(d = dist_matrix, method = "complete")

Cluster method   : complete 
Distance         : euclidean 
Number of objects: 1000

Part B Question 3.C

3.C(i)

Yes it does show clustering as groups of players can be seen through blocks of similar colours across the six variables. Groups of players can be seen with higher values (red) and other groups display lower values (blue) Lighter blue shows the weaker players, red shows the stronger all round players and a mix between dark blue and red shows a mix of abilities.

Part B Question 3.D


  1   2   3   4 
641 107 160  92

# A tibble: 4 × 7
  cluster acceleration ball_control dribbling shot_power short_passing
  <fct>          <dbl>        <dbl>     <dbl>      <dbl>         <dbl>
1 1               77.6         80.7      79.2       77.1          78.4
2 2               48.5         23.7      16.1       25.1          33.0
3 3               64.1         68.1      59.5       66.2          70.8
4 4               46.0         68.3      56.2       64.5          72.8
# ℹ 1 more variable: sprint_speed <dbl>

Cluster 1 - Players who are very good at everything and have a high average of everything compared to other clusters.

Cluster 2 - Players who are technically poor on the ball and are low across the board.

Cluster 3 - Players who are balanced across all metrics bar with there not being much difference across any of their stats.

Cluster 4 - Players who are slow in acceleration but have decent technical attributes across the board.

Part B Question E (i)

Cluster Profiles Based on Player Attributes
Cluster	Acceleration	Ball Control	Dribbling	Shot Power	Short Passing	Sprint Speed
1	77.6	80.7	79.2	77.1	78.4	77.0
2	48.5	23.7	16.1	25.1	33.0	49.2
3	64.1	68.1	59.5	66.2	70.8	68.0
4	46.0	68.3	56.2	64.5	72.8	48.1

Part B Question E (ii)

Cluster Comparison: Age, Club Value and Wage
cluster	Age	Club Value (€)	Wage (€)
1	26.7	20,385,959	77,239
2	29.1	14,350,935	51,766
3	27.1	14,617,500	61,725
4	29.8	10,783,696	47,880

We can see cluster 1 has a younger age, more club value and a higher wage.

Cluster 2 has the second highet average age, third in club value and third in wage.

Cluster 3 has the second youngest average age, along with the second highest value and the second highest wages.

Cluster 4 has the oldest age, along with the lowest club value and wage.

This shows in the clusters that the youngest are also part of the most valued club and wage.

Part B Question 4 (A)


  1   2   3   4 
127 381 384 108

Part B Question 4 (B)

[1] 0.3307156

The score of 0.3307 indicates there there is moderate clustering in the K-Means Clustering. Some overlap does exist between clusters however they are still well seperated.

Part B Question 4 (C)

K-means Cluster Profiles: Player Attributes
Cluster	Acceleration	Ball Control	Dribbling	Shot Power	Short Passing	Sprint Speed
1	49.6	66.9	55.0	63.3	71.0	52.3
2	68.2	76.5	71.3	73.9	77.1	69.0
3	82.9	81.4	81.3	77.3	77.7	82.4
4	48.7	23.9	16.3	25.3	33.1	49.4

Cluster 3 has the best all round averages for the statistics provided.

Cluster 2 also has devent attributes across the board however just not quite as good as cluster 3

Cluster 1 displays poor levels with average values across most of the attributes.

Cluster 4 is the weakest of all four clusters with poor technical abilities shown.

Part B Question 4 C (ii)

K-means Clusters: Age, Club Value and Wage
Cluster	Age	Club Value (€)	Wage (€)
1	29.1	10,957,480	49,449
2	27.4	16,885,564	69,008
3	26.1	22,302,865	81,154
4	29.1	14,301,389	51,806

Cluster 3 is with the youngest players and the highest club value and wages, similar to the cluster in the hierarchical clustering.

Custer 2 ranks the second youngest along with second in club value and in wages.

Cluster 4 has the third highest average age along with the third highest club value and wage.

Cluster 1 has the oldest age group along with the lowest club value and wage.

Part B Question 5.A

It can be said that K means Clustering can provide the higher quality clusters, reasoning behind this is that it gives a silhouette score as opposed to hierarchial clustering which only has a heatmap and a dendrogram. The heatmap and dendrogram can be said to be subjective as opposed to silhouette.

Part B Question 5.B

Both hierarchical clustering and K-means clustering both produced clusters with similar profiles. There would be one really strong cluster, one strong cluster, one average cluster and one weak cluster. There was no massive notable differences between the clusters.