data <- read.csv("BasketballData.csv", header = TRUE, sep = ",", dec = ".")
head(data, 10)
## player_name team_abbreviation country gp pts reb ast usg_pct ts_pct
## 1 Quentin Grimes NYK USA 71 11.3 3.2 2.1 0.142 0.619
## 2 Quenton Jackson WAS USA 9 6.2 0.9 1.7 0.164 0.542
## 3 Pat Connaughton MIL USA 61 7.6 4.6 1.3 0.133 0.531
## 4 RJ Barrett NYK Canada 73 19.6 5.0 2.8 0.256 0.531
## 5 Precious Achiuwa TOR Nigeria 55 9.2 6.0 0.9 0.190 0.554
## 6 RaiQuan Gray BKN USA 1 16.0 9.0 7.0 0.205 0.621
## 7 R.J. Hampton DET USA 47 6.4 1.9 1.1 0.175 0.558
## 8 Peyton Watson DEN USA 23 3.3 1.6 0.5 0.174 0.553
## 9 Patrick Beverley CHI USA 67 6.2 3.7 2.9 0.107 0.534
## 10 Paul Reed PHI USA 69 4.2 3.8 0.4 0.152 0.620
This data set includes a list of 200 observations (Basketball players) and 9 variables.
Dataset retrieved from https://www.kaggle.com/
summary(data[c(4:9)])
## gp pts reb ast
## Min. : 1.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:29.00 1st Qu.: 3.975 1st Qu.: 1.700 1st Qu.: 0.700
## Median :56.00 Median : 6.700 Median : 2.900 Median : 1.300
## Mean :48.13 Mean : 8.630 Mean : 3.403 Mean : 2.026
## 3rd Qu.:68.25 3rd Qu.:10.700 3rd Qu.: 4.425 3rd Qu.: 2.700
## Max. :83.00 Max. :31.400 Max. :12.300 Max. :10.400
## usg_pct ts_pct
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1410 1st Qu.:0.5222
## Median :0.1740 Median :0.5660
## Mean :0.1781 Mean :0.5553
## 3rd Qu.:0.2015 3rd Qu.:0.6085
## Max. :0.5000 Max. :0.9000
I have shown here the numerical variables from the dataset. Research question : How can basketball players be grouped based on key performance metrics, and what distinguishes these clusters?
data_clu_std <- as.data.frame(scale(data[c(5:9)]))
Used this function for standardisation of cluster variables. I will use all numerical variables except gp for clustering purpose.
data$Dissimilarity = sqrt(data_clu_std$pts^2 + data_clu_std$reb^2 + data_clu_std$ast^2 + data_clu_std$usg_pct^2 + data_clu_std$ts_pct^2)
head(data[order(-data$Dissimilarity), c("player_name", "team_abbreviation", "country", "Dissimilarity" )], 10)
## player_name team_abbreviation country Dissimilarity
## 165 Michael Foster Jr. PHI USA 6.345683
## 77 Nikola Jokic DEN Serbia 6.084864
## 131 Stanley Umude DET USA 5.993388
## 129 Sterling Brown LAL USA 5.705554
## 169 Deonte Burton SAC USA 5.583670
## 148 Trae Young ATL USA 5.328453
## 181 Domantas Sabonis SAC Lithuania 5.016857
## 138 Shai Gilgeous-Alexander OKC Canada 4.646465
## 91 Tyrese Haliburton IND USA 4.612262
## 130 Stephen Curry GSW USA 4.558166
I have examined whether there are any outliers in my dataset after standardization. Based on the output, I have not identified any noticeable outliers.
#install.packages("factoextra")
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distance <- get_dist(data_clu_std,
method = "euclidian")
fviz_dist(Distance,
gradient = list(low= "darkred",
mid = "grey95",
high = "white"))
On the diagonal the colour is the most intensive because it’s showing the same players from x and y axis. Based on this dissimilarity matrix I would say that is possible to make two clusters.
library(factoextra)
get_clust_tendency(data_clu_std,
n= nrow(data_clu_std)- 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8370501
##
## $plot
## NULL
In short, a Hopkins statistic higher than 0.5 suggests that clustering is a reasonable approach for analyzing the dataset.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(factoextra)
library(magrittr)
WARD <- data_clu_std %>%
get_dist(method = "euclidian") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 200
Approaching clustering with Hierarchical method. Ward algorithm gives the best method for calculating the distance.
library(factoextra)
# install.packages("ggplot2")
library(ggplot2)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Based on this Dendogram I would say that we can take 2 clusters where the biggest jump between horizontal lines.
library(factoextra)
#install.packages("NbClust")
library(NbClust)
fviz_nbclust(data_clu_std, kmeans, method = "wss") +
labs (subtitle = "Elbow method")
Based on the Elbow Method, k = 4 or 5 seems to be the optimal number of clusters.
fviz_nbclust(data_clu_std, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette analysis")
Silhouette analysis suggest 2 clusters.
library(NbClust)
NbClust(data_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 7 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 3.6553 108.9938 46.5754 -2.2984 254.7501 44951530374 16036.571 641.7394
## 3 1.0301 90.1463 35.3706 -2.9830 422.8800 43635250503 10854.635 519.5306
## 4 0.9317 82.2583 30.7938 -2.9180 573.9847 36441427910 10105.471 440.4496
## 5 4.9483 78.6815 12.3097 -2.0526 704.3271 29674288462 6323.810 380.6458
## 6 0.1300 69.0248 32.6389 -2.9932 779.1770 29390596300 6050.202 358.0436
## 7 4.4469 72.2635 15.1951 -0.5627 908.1274 20993738426 4572.313 306.4808
## 8 0.5850 68.6300 17.3598 -0.4000 1019.8521 15684348923 3709.065 284.1124
## 9 0.2954 67.2984 39.6333 0.2756 1106.1313 12894917619 2847.781 260.5542
## 10 5.1639 76.2364 13.6001 3.9560 1279.0229 6706628908 2021.468 215.7792
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 6.0467 1.5505 0.2002 1.2113 0.4088 0.7208 57.7049 1.2040 0.3897
## 3 8.0283 1.9152 0.1657 1.3541 0.3334 1.1289 -13.3598 -0.3499 0.3934
## 4 10.4963 2.2591 0.2156 1.2433 0.2782 1.0096 -1.3303 -0.0295 0.3698
## 5 12.0465 2.6140 0.2101 1.2202 0.2920 1.1362 -14.6281 -0.3720 0.3495
## 6 13.9831 2.7790 0.1976 1.2750 0.2207 2.1332 -56.8406 -1.6418 0.3248
## 7 15.7553 3.2465 0.1793 1.1743 0.2570 1.4455 -26.1978 -0.9534 0.3130
## 8 19.6170 3.5021 0.1760 1.3562 0.2175 1.0732 -5.7970 -0.2107 0.2976
## 9 21.8128 3.8188 0.1878 1.2672 0.2264 1.4533 -19.6502 -0.9581 0.2858
## 10 27.9244 4.6112 0.2362 1.1890 0.2493 1.6579 -21.0324 -1.2138 0.2796
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 320.8697 0.5218 0.7875 0.3292 0.0569 0.0021 2.5316 1.5196 1.0001
## 3 173.1769 0.5602 -7.7574 0.6128 0.0442 0.0024 2.9399 1.3834 1.1216
## 4 110.1124 0.5153 -0.0688 0.7318 0.0511 0.0027 2.5605 1.3041 0.6000
## 5 76.1292 0.5678 3.0305 0.7661 0.0490 0.0030 2.6007 1.2230 0.6919
## 6 59.6739 0.4634 -0.3691 1.3666 0.0408 0.0030 2.7303 1.1764 0.5920
## 7 43.7830 0.5129 4.0476 1.2274 0.0490 0.0032 2.5829 1.1045 0.5051
## 8 35.5140 0.4641 2.8566 1.5709 0.0427 0.0034 3.2307 1.0634 0.6042
## 9 28.9505 0.3853 -0.4350 2.4500 0.0403 0.0035 3.2427 1.0143 0.5041
## 10 21.5779 0.4243 1.2628 2.0977 0.0831 0.0035 3.1843 0.9470 0.3771
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.7219 57.3950 0.3054
## 3 0.6080 75.4265 1.0000
## 4 0.6999 60.0218 1.0000
## 5 0.6992 52.4932 1.0000
## 6 0.6668 53.4593 1.0000
## 7 0.6741 41.1014 1.0000
## 8 0.6629 43.2216 1.0000
## 9 0.6229 38.1436 1.0000
## 10 0.5965 35.8498 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 10.0000 2.0000 10.0000 10.000 10.0000 5 3.000
## Value_Index 5.1639 108.9938 26.0332 3.956 172.8916 6483447287 5181.937
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.0000 10.0000 7.0000 3.0000 7.0000 2.0000 3.0000
## Value_Index 43.1278 6.1116 -0.2119 0.1657 1.1743 0.4088 1.1289
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 3.0000 2.000 3.0000 3.0000 5.0000 1 2.0000
## Value_Index -13.3598 1.204 0.3934 147.6928 0.5678 NA 0.3292
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 10.0000 0 2.0000 0 10.0000
## Value_Index 0.0831 0 2.5316 0 0.3771
##
## $Best.partition
## [1] 2 2 2 1 2 1 2 2 2 2 1 2 1 2 2 2 2 2 1 2 2 1 2 2 3 3 2 1 2 2 2 2 2 2 2 2 3
## [38] 2 3 2 2 2 2 2 2 1 2 2 2 2 1 2 2 3 2 1 2 1 2 1 3 2 2 2 2 2 2 2 2 2 2 1 1 2
## [75] 1 2 1 2 2 1 2 2 1 3 2 3 2 1 1 3 1 1 2 3 2 2 2 2 3 3 2 3 3 2 1 1 2 2 1 1 2
## [112] 2 2 3 2 2 2 2 3 1 2 3 2 3 2 1 2 1 3 1 3 2 1 1 2 1 2 1 2 2 2 3 2 2 2 2 1 1
## [149] 2 2 1 2 2 2 2 3 2 2 1 2 2 2 2 1 3 2 2 2 3 3 2 3 1 1 1 1 3 2 2 1 1 2 1 3 2
## [186] 2 1 2 2 1 2 2 2 2 2 1 3 2 2 2
***** Conclusion *****
Clustering <- kmeans(data_clu_std,
centers = 3,
nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 29, 128, 43
##
## Cluster means:
## pts reb ast usg_pct ts_pct
## 1 -1.0051548 -1.0087016 -0.6375133 -0.02157004 -1.6756620
## 2 -0.2963039 -0.1315719 -0.3035475 -0.32566257 0.2676287
## 3 1.5599160 1.0719430 1.3335340 0.98396141 0.3334355
##
## Clustering vector:
## [1] 2 2 2 3 2 3 2 2 2 2 3 2 3 2 2 2 2 2 3 2 2 3 2 2 1 1 2 3 2 2 2 2 2 2 2 2 1
## [38] 2 1 2 2 2 2 2 2 3 2 2 2 2 2 2 2 1 2 3 2 3 2 3 1 2 2 2 2 2 2 2 2 2 2 3 3 2
## [75] 3 2 3 2 2 3 2 2 3 1 2 1 2 2 3 1 3 3 2 1 2 2 2 2 1 1 2 1 1 2 3 3 2 2 3 3 2
## [112] 2 2 1 2 2 2 2 1 3 2 1 2 1 2 2 2 3 1 3 1 2 3 3 2 3 2 3 2 2 2 1 2 2 2 2 3 3
## [149] 2 2 3 2 2 2 2 1 2 2 3 2 2 2 2 3 1 2 2 2 1 1 2 1 2 3 3 3 1 2 2 2 3 2 3 1 2
## [186] 2 2 2 2 3 2 2 2 2 2 3 1 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 132.0950 205.0398 180.9174
## (between_SS / total_SS = 47.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
rownames(data_clu_std) <- data$player_name
library(factoextra)
library(ggrepel)
fviz_cluster(Clustering,
data = data_clu_std,
palette = "Set1", # Color palette
repel = TRUE, # Avoid overlapping labels
ggtheme = theme_bw(),
geom = "point", # Show points instead of text labels
ellipse = TRUE, # Add cluster boundaries
show.clust.cent = TRUE) + # Show cluster centers
geom_text_repel(aes(label = rownames(data_clu_std)),
size = 3 ,
alpha = 0.8,
max.overlaps = 50) + # Tidy labels
theme(legend.position = "bottom")
## Warning: ggrepel: 75 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
So here visually we see 3 clusters, which was our main goal to try to find what is similar within the group and to distinguish the differences between the groups.
Averages <- Clustering$centers
Averages
## pts reb ast usg_pct ts_pct
## 1 -1.0051548 -1.0087016 -0.6375133 -0.02157004 -1.6756620
## 2 -0.2963039 -0.1315719 -0.3035475 -0.32566257 0.2676287
## 3 1.5599160 1.0719430 1.3335340 0.98396141 0.3334355
Figure <- as.data.frame(Averages)
Figure$player_name <- 1:nrow(Figure)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
Figure <- pivot_longer(Figure, cols = c( "pts", "reb", "ast", "usg_pct", "ts_pct"))
Figure$Group <- factor(Figure$player_name,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$ImeF <- factor(Figure$name,
levels = c("pts", "reb", "ast", "usg_pct", "ts_pct"),
labels = c("pts", "reb", "ast", "usg_pct", "ts_pct"))
str(data)
## 'data.frame': 200 obs. of 10 variables:
## $ player_name : chr "Quentin Grimes" "Quenton Jackson" "Pat Connaughton" "RJ Barrett" ...
## $ team_abbreviation: chr "NYK" "WAS" "MIL" "NYK" ...
## $ country : chr "USA" "USA" "USA" "Canada" ...
## $ gp : int 71 9 61 73 55 1 47 23 67 69 ...
## $ pts : num 11.3 6.2 7.6 19.6 9.2 16 6.4 3.3 6.2 4.2 ...
## $ reb : num 3.2 0.9 4.6 5 6 9 1.9 1.6 3.7 3.8 ...
## $ ast : num 2.1 1.7 1.3 2.8 0.9 7 1.1 0.5 2.9 0.4 ...
## $ usg_pct : num 0.142 0.164 0.133 0.256 0.19 0.205 0.175 0.174 0.107 0.152 ...
## $ ts_pct : num 0.619 0.542 0.531 0.531 0.554 0.621 0.558 0.553 0.534 0.62 ...
## $ Dissimilarity : num 0.954 1.183 1.042 2.304 1.265 ...
library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = player_name), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
scale_color_brewer(palette="Set1") +
ylim(-2.2, 2.4) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
Group “green” has the best performance, so those players in this cluster are better in comparison to other two clusters.
str(Clustering)
## List of 9
## $ cluster : int [1:200] 2 2 2 3 2 3 2 2 2 2 ...
## $ centers : num [1:3, 1:5] -1.005 -0.296 1.56 -1.009 -0.132 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:5] "pts" "reb" "ast" "usg_pct" ...
## $ totss : num 995
## $ withinss : num [1:3] 132 205 181
## $ tot.withinss: num 518
## $ betweenss : num 477
## $ size : int [1:3] 29 128 43
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
data$Group <-Clustering$cluster
fit <- aov(cbind( pts,reb, ast, usg_pct, ts_pct) ~ as.factor(Group),
data=data)
summary(fit)
## Response pts :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 6277.6 3138.79 265.64 < 2.2e-16 ***
## Residuals 197 2327.7 11.82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response reb :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 436.96 218.482 67.801 < 2.2e-16 ***
## Residuals 197 634.81 3.222
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response ast :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 428.01 214.007 99.591 < 2.2e-16 ***
## Residuals 197 423.33 2.149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response usg_pct :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.18026 0.090132 37.83 1.248e-14 ***
## Residuals 197 0.46936 0.002383
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response ts_pct :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 1.1432 0.57162 90.66 < 2.2e-16 ***
## Residuals 197 1.2421 0.00631
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aggregate(data$gp,
by = list(data$Group),
FUN = mean)
## Group.1 x
## 1 1 15.72414
## 2 2 51.01562
## 3 3 61.41860
I have chosen gp as a criterion validity because this one haven’t been used for clustering purposes.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(data$gp, as.factor(data$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 5.5526 0.004509 **
## 197
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With Levene test I checked homogeneity of variance. Since p=0.005 we reject null hypothesis, concluding that variance of gp is not equal across the three clusters.
library(dplyr)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
data %>%
group_by(Group) %>%
shapiro_test(gp)
## # A tibble: 3 × 4
## Group variable statistic p
## <int> <chr> <dbl> <dbl>
## 1 1 gp 0.857 0.00104
## 2 2 gp 0.927 0.00000349
## 3 3 gp 0.770 0.000000864
Checking the normality, based on p-value, for example for G2 we reject null hypothesis, gp is not normally distributed. Kruskal test needed.
fit <- aov(gp ~ as.factor(Group),
data = data)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 39113 19557 44.22 <2e-16 ***
## Residuals 197 87124 442
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A one-way ANOVA showed a significant difference in gp across clusters, F(2, 197) = 44.22, p < 0.001, indicating at least one cluster differs.
kruskal.test(gp ~ Group,
data =data)
##
## Kruskal-Wallis rank sum test
##
## data: gp by Group
## Kruskal-Wallis chi-squared = 53.227, df = 2, p-value = 2.767e-12
The Kruskal-Wallis test was used because it is a non-parametric alternative to ANOVA, which does not assume normality or equal variances. Given that Levene’s test indicated significant variance differences, Kruskal-Wallis was appropriate for comparing gp across clusters. The test result (χ²(2) = 53.227, p < 0.001) confirms a significant difference in gp distributions among the groups.
kruskal_effsize(gp ~ Group,
data = data)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 gp 200 0.260 eta2[H] large
The Kruskal-Wallis test confirmed a significant difference in gp across the three clusters (χ²(2) = 53.227, p < 0.001), indicating that at least one cluster has a distinct distribution of gp values. Furthermore, the effect size (η² = 0.26) suggests a large effect, meaning that group differences are substantial. This indicates that the clustering method effectively differentiates the dataset based on gp, leading to meaningful distinctions among the three groups.
CONCLUSION
Basketball players were grouped into three distinct clusters based on key performance metrics. The first cluster consists of high-impact players excelling in scoring, playmaking, and overall contributions. The second cluster represents balanced role players who contribute across multiple areas. The third cluster includes low-usage or specialized players with limited statistical impact. These findings highlight meaningful differences in player profiles based on performance metrics