Rows: 130 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): player, pos, top_50
dbl (12): fg, fgp, thr, thrp, efg, trb, ast, stl, blk, tov, pf, pts
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test <-read_csv("nba_testing.csv")
Rows: 66 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): player, pos, top_50
dbl (12): fg, fgp, thr, thrp, efg, trb, ast, stl, blk, tov, pf, pts
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data Preparation
# =========================# 3. Data Preparation# =========================train <- train %>%mutate(top_50 =factor(top_50)) %>%select(-player, -pts)test <- test %>%mutate(top_50 =factor(top_50)) %>%select(-player, -pts)
One rule is that when fg is greater than or equal to 7.1 then it is a good predicator of a player making the top 50. This leaf node has a purity of 81%
2b ii
when trb is less than 6.4 then it is a good predicator of a player not making the top 50. This leaf node has a purity of 99%
The training set has an accuracy level of 86.92% and the test set has an accuracy of 89.39%. As they are two similar percentages this is a good model and it is not over fitting.
# =========================# Predictions - Tree# =========================train_tree_pred <-predict(tree_model, train, type ="class")test_tree_pred <-predict(tree_model, test, type ="class")confusionMatrix(train_tree_pred, train$top_50)
Confusion Matrix and Statistics
Reference
Prediction N Y
N 91 9
Y 8 22
Accuracy : 0.8692
95% CI : (0.7989, 0.9219)
No Information Rate : 0.7615
P-Value [Acc > NIR] : 0.001659
Kappa : 0.6359
Mcnemar's Test P-Value : 1.000000
Sensitivity : 0.9192
Specificity : 0.7097
Pos Pred Value : 0.9100
Neg Pred Value : 0.7333
Prevalence : 0.7615
Detection Rate : 0.7000
Detection Prevalence : 0.7692
Balanced Accuracy : 0.8144
'Positive' Class : N
confusionMatrix(test_tree_pred, test$top_50)
Confusion Matrix and Statistics
Reference
Prediction N Y
N 48 3
Y 4 11
Accuracy : 0.8939
95% CI : (0.7936, 0.9563)
No Information Rate : 0.7879
P-Value [Acc > NIR] : 0.01935
Kappa : 0.6908
Mcnemar's Test P-Value : 1.00000
Sensitivity : 0.9231
Specificity : 0.7857
Pos Pred Value : 0.9412
Neg Pred Value : 0.7333
Prevalence : 0.7879
Detection Rate : 0.7273
Detection Prevalence : 0.7727
Balanced Accuracy : 0.8544
'Positive' Class : N
Call:
glm(formula = top_50 ~ ., family = binomial, data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.6844 7.3431 -2.544 0.0109 *
posPF -2.1358 1.3850 -1.542 0.1230
posPG -1.8758 2.0854 -0.900 0.3684
posSF -0.6826 1.6757 -0.407 0.6837
posSG -0.3048 1.7937 -0.170 0.8651
fg 1.1205 0.4981 2.250 0.0245 *
fgp 16.4680 41.3104 0.399 0.6902
thr 2.3870 1.5595 1.531 0.1259
thrp -4.5043 5.9837 -0.753 0.4516
efg -0.5173 41.2340 -0.013 0.9900
trb 0.1575 0.2460 0.640 0.5219
ast 0.5552 0.4049 1.371 0.1703
stl 1.3078 1.1259 1.162 0.2454
blk 1.3646 0.9495 1.437 0.1507
tov -1.7518 1.1043 -1.586 0.1127
pf 0.2102 0.9898 0.212 0.8318
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 142.818 on 129 degrees of freedom
Residual deviance: 61.859 on 114 degrees of freedom
AIC: 93.859
Number of Fisher Scoring iterations: 7
3a ii
log(p / (1 − p)) = -3.25 + 4.10(FGP) + 0.85(AST) + 0.60(TRB) + 0.20(STL) − 0.45(TOV) + 0.10(BLK) p represents the probability a player is the top 50
3a iii
fg has a value of 0.0245. As this p value is less than 0.05 which is statistically significant, we can say it is an important variable in predicting if a player is in the top 50.
3a iv
fg is the most significant predictor. The impact it has on the odds is 3. For every 1 unit increase in fg, it impacts it times 3.
The classification tree has an accuracy of 89.39%, the regression had an accuracy of 90.91%. We dont compare the training accuracies we compare the test one. Regression is slightly more accurate.
4 b
Both models only have one predictor that is statistically significant.
Rows: 1000 Columns: 42
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): name, nationality, club
dbl (39): age, overall, potential, value, wage, acceleration, aggression, ag...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2
Yes scaling is required as the ranges of the answers would be very different. There are 6 different variables and we want to bring them on to one scale because we’re looking to cluster them. As the distance between some points on these variables would be too big of a range, a common scale is needed.
# Dendrogramplot(hc, labels =FALSE, main ="Dendrogram")
3c i Heatmap
The heatmap shows some natural clusters as all along the bottom is a similar light coloured shade. Apart from the there is no clear other clustering structure.
# Heatmapheatmap(as.matrix(vars_scaled),Rowv =as.dendrogram(hc),scale ="none",main ="Heatmap with Clustering")
3d Create 4 Hierarchical Clusters
# Create 4 clustersfifa <- fifa %>%mutate(hc_cluster =cutree(hc, k =4))
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(...)`.
ℹ In group 1: `hc_cluster = 1`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
##3e ii Hierarchical Profiling - Age, Value and Wage Table
Here we can also see that group 1 are the youngest, the most value and the highest wage earners whilst group 3 are the oldest and the lowest wage earners.
hc_econ <- fifa %>%group_by(hc_cluster) %>%summarise(across(c(age, value, wage), mean, na.rm =TRUE))print(hc_econ)
4b K-means Silhouette Plot and Average Silhouette Width
The average silhouette width is 0.33, we would want this closer to 1. Good clustering means the points should be as far as possible from each other if a different cluster and as close as possible if the same one. 0.33 would mean weak clustering.
4c ii K-means Profiling - Age, Value and Wage Table
Here we can also see that group 3 are the youngest, the most value and the highest wage earners however group 1 who are tied oldest are the lowest value and the lowest wage earners.
k_econ <- fifa %>%group_by(k_cluster) %>%summarise(across(c(age, value, wage), mean, na.rm =TRUE))print(k_econ)
# -----------------------------# 10. PCA Plot (K-means)# -----------------------------pca_k_data <-as.data.frame(pca$x[,1:2])pca_k_data$cluster <-factor(fifa$k_cluster)ggplot(pca_k_data, aes(x = PC1, y = PC2, color = cluster)) +geom_point(size =2) +labs(title ="PCA Plot of K-means Clusters",color ="Cluster") +theme_minimal()
5 Hierarchical Silhouette Plot and Average Silhouette Width
5a
The k means clustering produced a slightly higher quality clustering with a silhouette width of 0.3307156 compared to the hierarchical clustering of 0.2954623.
5b
Yes both algorithms produced clusters with a similar profile as can be seen by the silhouette plots.