Homework 2

This data set includes a list of 200 observations (Basketball players) and 9 variables.

Description:

player_name: Name of the Basketball player
team_abbreviation: Abbreviation of a basketball team in NBA
country: Player’s country of origin
gp: Games Played
pts: Points scored
reb: Rebounds
ast: Assists
usg_pct: Usage Percentage (percentage of team plays used by a player while on the floor)
ts_pct: True Shooting Percentage (measures shooting efficiency, including 2-pointers, 3-pointers, and free throws)

Dataset retrieved from https://www.kaggle.com/

summary(data[c(4:9)])

##        gp             pts              reb              ast        
##  Min.   : 1.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:29.00   1st Qu.: 3.975   1st Qu.: 1.700   1st Qu.: 0.700  
##  Median :56.00   Median : 6.700   Median : 2.900   Median : 1.300  
##  Mean   :48.13   Mean   : 8.630   Mean   : 3.403   Mean   : 2.026  
##  3rd Qu.:68.25   3rd Qu.:10.700   3rd Qu.: 4.425   3rd Qu.: 2.700  
##  Max.   :83.00   Max.   :31.400   Max.   :12.300   Max.   :10.400  
##     usg_pct           ts_pct      
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1410   1st Qu.:0.5222  
##  Median :0.1740   Median :0.5660  
##  Mean   :0.1781   Mean   :0.5553  
##  3rd Qu.:0.2015   3rd Qu.:0.6085  
##  Max.   :0.5000   Max.   :0.9000

I have shown here the numerical variables from the dataset. Research question : How can basketball players be grouped based on key performance metrics, and what distinguishes these clusters?

data_clu_std <- as.data.frame(scale(data[c(5:9)]))

Used this function for standardisation of cluster variables. I will use all numerical variables except gp for clustering purpose.

data$Dissimilarity = sqrt(data_clu_std$pts^2 + data_clu_std$reb^2 + data_clu_std$ast^2 + data_clu_std$usg_pct^2 + data_clu_std$ts_pct^2)

head(data[order(-data$Dissimilarity), c("player_name", "team_abbreviation", "country", "Dissimilarity" )], 10)

##                 player_name team_abbreviation   country Dissimilarity
## 165      Michael Foster Jr.               PHI       USA      6.345683
## 77             Nikola Jokic               DEN    Serbia      6.084864
## 131           Stanley Umude               DET       USA      5.993388
## 129          Sterling Brown               LAL       USA      5.705554
## 169           Deonte Burton               SAC       USA      5.583670
## 148              Trae Young               ATL       USA      5.328453
## 181        Domantas Sabonis               SAC Lithuania      5.016857
## 138 Shai Gilgeous-Alexander               OKC    Canada      4.646465
## 91        Tyrese Haliburton               IND       USA      4.612262
## 130           Stephen Curry               GSW       USA      4.558166

I have examined whether there are any outliers in my dataset after standardization. Based on the output, I have not identified any noticeable outliers.

#install.packages("factoextra")
library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Distance <- get_dist(data_clu_std,
                     method = "euclidian")

fviz_dist(Distance, 
          gradient = list(low= "darkred",
                          mid = "grey95",
                          high = "white"))

On the diagonal the colour is the most intensive because it’s showing the same players from x and y axis. Based on this dissimilarity matrix I would say that is possible to make two clusters.

library(factoextra)
get_clust_tendency(data_clu_std,
                    n= nrow(data_clu_std)- 1,
                  graph = FALSE)

## $hopkins_stat
## [1] 0.8370501
## 
## $plot
## NULL

In short, a Hopkins statistic higher than 0.5 suggests that clustering is a reasonable approach for analyzing the dataset.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(factoextra)
library(magrittr)
WARD <- data_clu_std %>% 
  get_dist(method = "euclidian") %>%
             hclust(method = "ward.D2")
WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 200

Approaching clustering with Hierarchical method. Ward algorithm gives the best method for calculating the distance.

library(factoextra)

# install.packages("ggplot2")
library(ggplot2)

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on this Dendogram I would say that we can take 2 clusters where the biggest jump between horizontal lines.

library(factoextra)
#install.packages("NbClust")
library(NbClust)



fviz_nbclust(data_clu_std, kmeans, method = "wss") + 
  labs (subtitle = "Elbow method")

Based on the Elbow Method, k = 4 or 5 seems to be the optimal number of clusters.

fviz_nbclust(data_clu_std, kmeans, method = "silhouette") + 
  labs(subtitle = "Silhouette analysis")

Silhouette analysis suggest 2 clusters.

library(NbClust)
NbClust(data_clu_std,
        distance = "euclidean",
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 7 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## $All.index
##        KL       CH Hartigan     CCC     Scott     Marriot    TrCovW   TraceW
## 2  3.6553 108.9938  46.5754 -2.2984  254.7501 44951530374 16036.571 641.7394
## 3  1.0301  90.1463  35.3706 -2.9830  422.8800 43635250503 10854.635 519.5306
## 4  0.9317  82.2583  30.7938 -2.9180  573.9847 36441427910 10105.471 440.4496
## 5  4.9483  78.6815  12.3097 -2.0526  704.3271 29674288462  6323.810 380.6458
## 6  0.1300  69.0248  32.6389 -2.9932  779.1770 29390596300  6050.202 358.0436
## 7  4.4469  72.2635  15.1951 -0.5627  908.1274 20993738426  4572.313 306.4808
## 8  0.5850  68.6300  17.3598 -0.4000 1019.8521 15684348923  3709.065 284.1124
## 9  0.2954  67.2984  39.6333  0.2756 1106.1313 12894917619  2847.781 260.5542
## 10 5.1639  76.2364  13.6001  3.9560 1279.0229  6706628908  2021.468 215.7792
##    Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2    6.0467 1.5505 0.2002 1.2113     0.4088 0.7208  57.7049  1.2040    0.3897
## 3    8.0283 1.9152 0.1657 1.3541     0.3334 1.1289 -13.3598 -0.3499    0.3934
## 4   10.4963 2.2591 0.2156 1.2433     0.2782 1.0096  -1.3303 -0.0295    0.3698
## 5   12.0465 2.6140 0.2101 1.2202     0.2920 1.1362 -14.6281 -0.3720    0.3495
## 6   13.9831 2.7790 0.1976 1.2750     0.2207 2.1332 -56.8406 -1.6418    0.3248
## 7   15.7553 3.2465 0.1793 1.1743     0.2570 1.4455 -26.1978 -0.9534    0.3130
## 8   19.6170 3.5021 0.1760 1.3562     0.2175 1.0732  -5.7970 -0.2107    0.2976
## 9   21.8128 3.8188 0.1878 1.2672     0.2264 1.4533 -19.6502 -0.9581    0.2858
## 10  27.9244 4.6112 0.2362 1.1890     0.2493 1.6579 -21.0324 -1.2138    0.2796
##        Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  320.8697     0.5218  0.7875  0.3292 0.0569 0.0021  2.5316 1.5196 1.0001
## 3  173.1769     0.5602 -7.7574  0.6128 0.0442 0.0024  2.9399 1.3834 1.1216
## 4  110.1124     0.5153 -0.0688  0.7318 0.0511 0.0027  2.5605 1.3041 0.6000
## 5   76.1292     0.5678  3.0305  0.7661 0.0490 0.0030  2.6007 1.2230 0.6919
## 6   59.6739     0.4634 -0.3691  1.3666 0.0408 0.0030  2.7303 1.1764 0.5920
## 7   43.7830     0.5129  4.0476  1.2274 0.0490 0.0032  2.5829 1.1045 0.5051
## 8   35.5140     0.4641  2.8566  1.5709 0.0427 0.0034  3.2307 1.0634 0.6042
## 9   28.9505     0.3853 -0.4350  2.4500 0.0403 0.0035  3.2427 1.0143 0.5041
## 10  21.5779     0.4243  1.2628  2.0977 0.0831 0.0035  3.1843 0.9470 0.3771
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.7219            57.3950       0.3054
## 3          0.6080            75.4265       1.0000
## 4          0.6999            60.0218       1.0000
## 5          0.6992            52.4932       1.0000
## 6          0.6668            53.4593       1.0000
## 7          0.6741            41.1014       1.0000
## 8          0.6629            43.2216       1.0000
## 9          0.6229            38.1436       1.0000
## 10         0.5965            35.8498       1.0000
## 
## $Best.nc
##                      KL       CH Hartigan    CCC    Scott    Marriot   TrCovW
## Number_clusters 10.0000   2.0000  10.0000 10.000  10.0000          5    3.000
## Value_Index      5.1639 108.9938  26.0332  3.956 172.8916 6483447287 5181.937
##                  TraceW Friedman   Rubin Cindex     DB Silhouette   Duda
## Number_clusters  3.0000  10.0000  7.0000 3.0000 7.0000     2.0000 3.0000
## Value_Index     43.1278   6.1116 -0.2119 0.1657 1.1743     0.4088 1.1289
##                 PseudoT2 Beale Ratkowsky     Ball PtBiserial Frey McClain
## Number_clusters   3.0000 2.000    3.0000   3.0000     5.0000    1  2.0000
## Value_Index     -13.3598 1.204    0.3934 147.6928     0.5678   NA  0.3292
##                    Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 10.0000      0  2.0000      0 10.0000
## Value_Index      0.0831      0  2.5316      0  0.3771
## 
## $Best.partition
##   [1] 2 2 2 1 2 1 2 2 2 2 1 2 1 2 2 2 2 2 1 2 2 1 2 2 3 3 2 1 2 2 2 2 2 2 2 2 3
##  [38] 2 3 2 2 2 2 2 2 1 2 2 2 2 1 2 2 3 2 1 2 1 2 1 3 2 2 2 2 2 2 2 2 2 2 1 1 2
##  [75] 1 2 1 2 2 1 2 2 1 3 2 3 2 1 1 3 1 1 2 3 2 2 2 2 3 3 2 3 3 2 1 1 2 2 1 1 2
## [112] 2 2 3 2 2 2 2 3 1 2 3 2 3 2 1 2 1 3 1 3 2 1 1 2 1 2 1 2 2 2 3 2 2 2 2 1 1
## [149] 2 2 1 2 2 2 2 3 2 2 1 2 2 2 2 1 3 2 2 2 3 3 2 3 1 1 1 1 3 2 2 1 1 2 1 3 2
## [186] 2 1 2 2 1 2 2 2 2 2 1 3 2 2 2

***** Conclusion *****

According to the majority rule, the best number of clusters is 3

Clustering <- kmeans(data_clu_std,
                     centers = 3,
                     nstart = 25)
Clustering

## K-means clustering with 3 clusters of sizes 29, 128, 43
## 
## Cluster means:
##          pts        reb        ast     usg_pct     ts_pct
## 1 -1.0051548 -1.0087016 -0.6375133 -0.02157004 -1.6756620
## 2 -0.2963039 -0.1315719 -0.3035475 -0.32566257  0.2676287
## 3  1.5599160  1.0719430  1.3335340  0.98396141  0.3334355
## 
## Clustering vector:
##   [1] 2 2 2 3 2 3 2 2 2 2 3 2 3 2 2 2 2 2 3 2 2 3 2 2 1 1 2 3 2 2 2 2 2 2 2 2 1
##  [38] 2 1 2 2 2 2 2 2 3 2 2 2 2 2 2 2 1 2 3 2 3 2 3 1 2 2 2 2 2 2 2 2 2 2 3 3 2
##  [75] 3 2 3 2 2 3 2 2 3 1 2 1 2 2 3 1 3 3 2 1 2 2 2 2 1 1 2 1 1 2 3 3 2 2 3 3 2
## [112] 2 2 1 2 2 2 2 1 3 2 1 2 1 2 2 2 3 1 3 1 2 3 3 2 3 2 3 2 2 2 1 2 2 2 2 3 3
## [149] 2 2 3 2 2 2 2 1 2 2 3 2 2 2 2 3 1 2 2 2 1 1 2 1 2 3 3 3 1 2 2 2 3 2 3 1 2
## [186] 2 2 2 2 3 2 2 2 2 2 3 1 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 132.0950 205.0398 180.9174
##  (between_SS / total_SS =  47.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

rownames(data_clu_std) <- data$player_name

library(factoextra)
library(ggrepel)
fviz_cluster(Clustering, 
             data = data_clu_std, 
             palette = "Set1",   # Color palette
             repel = TRUE,       # Avoid overlapping labels
             ggtheme = theme_bw(), 
             geom = "point",     # Show points instead of text labels
             ellipse = TRUE,     # Add cluster boundaries
             show.clust.cent = TRUE) +  # Show cluster centers
  geom_text_repel(aes(label = rownames(data_clu_std)), 
                  size = 3 ,
                  alpha = 0.8,
                  max.overlaps = 50) +  # Tidy labels
  theme(legend.position = "bottom")

## Warning: ggrepel: 75 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

So here visually we see 3 clusters, which was our main goal to try to find what is similar within the group and to distinguish the differences between the groups.

Averages <- Clustering$centers
Averages

##          pts        reb        ast     usg_pct     ts_pct
## 1 -1.0051548 -1.0087016 -0.6375133 -0.02157004 -1.6756620
## 2 -0.2963039 -0.1315719 -0.3035475 -0.32566257  0.2676287
## 3  1.5599160  1.0719430  1.3335340  0.98396141  0.3334355

Figure <- as.data.frame(Averages)
Figure$player_name <- 1:nrow(Figure)

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

Figure <- pivot_longer(Figure, cols = c( "pts", "reb", "ast", "usg_pct", "ts_pct"))


Figure$Group <- factor(Figure$player_name, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$ImeF <- factor(Figure$name, 
              levels = c("pts", "reb", "ast", "usg_pct", "ts_pct"), 
              labels = c("pts", "reb", "ast", "usg_pct", "ts_pct"))

str(data)

## 'data.frame':    200 obs. of  10 variables:
##  $ player_name      : chr  "Quentin Grimes" "Quenton Jackson" "Pat Connaughton" "RJ Barrett" ...
##  $ team_abbreviation: chr  "NYK" "WAS" "MIL" "NYK" ...
##  $ country          : chr  "USA" "USA" "USA" "Canada" ...
##  $ gp               : int  71 9 61 73 55 1 47 23 67 69 ...
##  $ pts              : num  11.3 6.2 7.6 19.6 9.2 16 6.4 3.3 6.2 4.2 ...
##  $ reb              : num  3.2 0.9 4.6 5 6 9 1.9 1.6 3.7 3.8 ...
##  $ ast              : num  2.1 1.7 1.3 2.8 0.9 7 1.1 0.5 2.9 0.4 ...
##  $ usg_pct          : num  0.142 0.164 0.133 0.256 0.19 0.205 0.175 0.174 0.107 0.152 ...
##  $ ts_pct           : num  0.619 0.542 0.531 0.531 0.554 0.621 0.558 0.553 0.534 0.62 ...
##  $ Dissimilarity    : num  0.954 1.183 1.042 2.304 1.265 ...

library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = player_name), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  scale_color_brewer(palette="Set1") +
  ylim(-2.2, 2.4) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

Group “green” has the best performance, so those players in this cluster are better in comparison to other two clusters.

str(Clustering)

## List of 9
##  $ cluster     : int [1:200] 2 2 2 3 2 3 2 2 2 2 ...
##  $ centers     : num [1:3, 1:5] -1.005 -0.296 1.56 -1.009 -0.132 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:5] "pts" "reb" "ast" "usg_pct" ...
##  $ totss       : num 995
##  $ withinss    : num [1:3] 132 205 181
##  $ tot.withinss: num 518
##  $ betweenss   : num 477
##  $ size        : int [1:3] 29 128 43
##  $ iter        : int 3
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

data$Group <-Clustering$cluster
fit <- aov(cbind( pts,reb, ast, usg_pct, ts_pct) ~ as.factor(Group),
           data=data)
summary(fit)

##  Response pts :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 6277.6 3138.79  265.64 < 2.2e-16 ***
## Residuals        197 2327.7   11.82                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response reb :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 436.96 218.482  67.801 < 2.2e-16 ***
## Residuals        197 634.81   3.222                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response ast :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 428.01 214.007  99.591 < 2.2e-16 ***
## Residuals        197 423.33   2.149                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response usg_pct :
##                   Df  Sum Sq  Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.18026 0.090132   37.83 1.248e-14 ***
## Residuals        197 0.46936 0.002383                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response ts_pct :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 1.1432 0.57162   90.66 < 2.2e-16 ***
## Residuals        197 1.2421 0.00631                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

aggregate(data$gp,
          by = list(data$Group),
          FUN = mean)

##   Group.1        x
## 1       1 15.72414
## 2       2 51.01562
## 3       3 61.41860

I have chosen gp as a criterion validity because this one haven’t been used for clustering purposes.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(data$gp, as.factor(data$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)   
## group   2  5.5526 0.004509 **
##       197                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With Levene test I checked homogeneity of variance. Since p=0.005 we reject null hypothesis, concluding that variance of gp is not equal across the three clusters.

library(dplyr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

data %>%
  group_by(Group) %>%
  shapiro_test(gp)

## # A tibble: 3 × 4
##   Group variable statistic           p
##   <int> <chr>        <dbl>       <dbl>
## 1     1 gp           0.857 0.00104    
## 2     2 gp           0.927 0.00000349 
## 3     3 gp           0.770 0.000000864

Checking the normality, based on p-value, for example for G2 we reject null hypothesis, gp is not normally distributed. Kruskal test needed.

fit <- aov(gp ~ as.factor(Group),
           data = data)

summary(fit)

##                   Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(Group)   2  39113   19557   44.22 <2e-16 ***
## Residuals        197  87124     442                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A one-way ANOVA showed a significant difference in gp across clusters, F(2, 197) = 44.22, p < 0.001, indicating at least one cluster differs.

kruskal.test(gp ~ Group,
             data =data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  gp by Group
## Kruskal-Wallis chi-squared = 53.227, df = 2, p-value = 2.767e-12

The Kruskal-Wallis test was used because it is a non-parametric alternative to ANOVA, which does not assume normality or equal variances. Given that Levene’s test indicated significant variance differences, Kruskal-Wallis was appropriate for comparing gp across clusters. The test result (χ²(2) = 53.227, p < 0.001) confirms a significant difference in gp distributions among the groups.

kruskal_effsize(gp ~ Group,
             data = data)

## # A tibble: 1 × 5
##   .y.       n effsize method  magnitude
## * <chr> <int>   <dbl> <chr>   <ord>    
## 1 gp      200   0.260 eta2[H] large

The Kruskal-Wallis test confirmed a significant difference in gp across the three clusters (χ²(2) = 53.227, p < 0.001), indicating that at least one cluster has a distinct distribution of gp values. Furthermore, the effect size (η² = 0.26) suggests a large effect, meaning that group differences are substantial. This indicates that the clustering method effectively differentiates the dataset based on gp, leading to meaningful distinctions among the three groups.

CONCLUSION

Basketball players were grouped into three distinct clusters based on key performance metrics. The first cluster consists of high-impact players excelling in scoring, playmaking, and overall contributions. The second cluster represents balanced role players who contribute across multiple areas. The third cluster includes low-usage or specialized players with limited statistical impact. These findings highlight meaningful differences in player profiles based on performance metrics

Homework 2

Nadina21

2025-01-27

Description: