About me

Introduction

Data science and sports analytics have opened new doors for using machine learning and data mining techniques in basketball performance analysis. This study aims to explore the advanced measures of basketball performance in the National Basketball Association (NBA) using Principal Component Analysis (PCA) and Clustering. PCA identifies the most significant variables, while Clustering groups similar observations using unsupervised classification algorithms. In this project, we will analyze NBA player data using PCA and Clustering methods in R to evaluate player performance and help coaches make better decisions. Our results will be presented using graphs and tables to visualize the clusters’ impact on the data.

Problematic

In this case study, we will examine NBA data to explore and understand several key questions. Specifically, we will seek to determine:

  1. The best NBA players of all time based on a variety of basketball skills and not just points?

  2. If NBA players possess skills that exceed their designated position, how can we help NBA coaches make better decisions and evaluate whether they are making the right choices?

Objective of the study

Data description

The data set contains the performances of the most important players in the history of NBA games. It was collected from the NBA statistics website. You can see the site and the glossary here: The best players in the NBA

Statistiques avancées de la NBA / Leaders de tous les temps.

Statistiques avancées de la NBA / Leaders de tous les temps.

This data was created by NBA.com to comprehensively describe the types of shots players take and how they score. The dataset includes all players who played one minute of game time (we’ll specify that in one minute). The exact columns are shown below with a sample of a few rows of data.

Importation of booksellers required

#Remove all objects from the workspace
rm(list=ls()) 

warning=FALSE
message=FALSE

# Import the necessary libraries for cleaning and analysis:

library(tidyverse)
library(GGally) # Visualization based on ggplot2
library(factoextra) # Visualization of the PCA based on ggplot2
library("data.table")
library(dendextend)
library(ggfortify)
library(ggrepel)
library(gridExtra)
library(knitr)
library(kableExtra)
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 3 > 1' in coercion to 'logical(1)'
library(mclust)
library(NbClust)
library(plyr)
library(tidyverse)

Import and prepare data

To facilitate the task of collecting the data, here is the link to the data we will use in this work on the Kaggle site: Top NBA Players. Installation of the kableExtra library for a good visualization of the data set table.

# Installation of the library
#install.packages("kableExtra")
# Loading libraries
library(knitr)
library(kableExtra)
# Import data
historical_players.df = read.csv(file = "nba.csv", header=T, sep=";")

We now see the structure of this data set:

glimpse(historical_players.df)
## Rows: 1,214
## Columns: 26
## $ player_id   <int> 893, 76375, 76127, 201142, 2544, 78497, 947, 77847, 76804,…
## $ player_name <chr> "Michael Jordan", "Wilt Chamberlain", "Elgin Baylor", "Kev…
## $ gp          <int> 1072, 1045, 846, 703, 1061, 932, 914, 792, 791, 1040, 1476…
## $ min         <dbl> 38.25653, 45.79809, 40.02719, 37.37980, 38.90009, 39.23927…
## $ fgm         <dbl> 11.37313, 12.13493, 10.27541, 9.19346, 9.82375, 9.67382, 9…
## $ fga         <dbl> 22.88899, 22.48517, 23.84279, 18.85349, 19.60697, 20.42060…
## $ fg_pct      <dbl> 0.497, 0.540, 0.431, 0.488, 0.501, 0.474, 0.425, 0.436, 0.…
## $ fg3m        <dbl> 0.54198, NA, NA, 1.79232, 1.38266, NA, 1.15864, NA, 0.0973…
## $ fg3a        <dbl> 1.65858, NA, NA, 4.72404, 4.04807, NA, 3.70131, NA, 0.3274…
## $ fg3_pct     <dbl> 0.327, NA, NA, 0.379, 0.342, NA, 0.313, NA, 0.297, NA, 0.2…
## $ ftm         <dbl> 6.83489, 5.79617, 6.81206, 7.01991, 6.10179, 7.68240, 6.97…
## $ fta         <dbl> 8.18284, 11.35120, 8.73641, 7.96017, 8.24882, 9.44313, 8.9…
## $ ft_pct      <dbl> 0.835, 0.511, 0.780, 0.882, 0.740, 0.814, 0.780, 0.761, 0.…
## $ oreb        <dbl> 1.55597, NA, NA, 0.78663, 1.21489, 0.96774, 0.81510, NA, 1…
## $ dreb        <dbl> 4.66791, NA, NA, 6.36984, 6.04807, 2.77419, 2.89825, NA, 3…
## $ reb         <dbl> 6.22388, 22.89378, 13.54965, 7.15647, 7.26296, 5.76824, 3.…
## $ ast         <dbl> 5.25466, 4.44306, 4.31442, 3.78805, 7.03205, 6.69313, 6.15…
## $ stl         <dbl> 2.34515, NA, NA, 1.19488, 1.64844, 2.61290, 2.16958, NA, 1…
## $ blk         <dbl> 0.83302, NA, NA, 1.04979, 0.77003, 0.74194, 0.17943, NA, 0…
## $ tov         <dbl> 2.72761, NA, NA, 3.15932, 3.41093, NA, 3.56893, NA, 3.0141…
## $ pf          <dbl> 2.59608, 1.98565, 3.06856, 1.89189, 1.86334, 2.61266, 1.94…
## $ pts         <dbl> 30.12313, 30.06603, 27.36288, 27.19915, 27.13195, 27.03004…
## $ ast_tov     <dbl> 1.92647, NA, NA, 1.19901, 2.06162, NA, 1.72410, NA, 0.9246…
## $ stl_tov     <dbl> 0.85978, NA, NA, 0.37821, 0.48328, NA, 0.60791, NA, 0.3912…
## $ efg_pct     <dbl> 0.50872, 0.53969, 0.43097, 0.53516, 0.53629, 0.47373, 0.45…
## $ ts_pct      <dbl> 0.56859, 0.54706, 0.49415, 0.60832, 0.58382, 0.54994, 0.51…

We can see that the first two variables player_id and gp are integers and the variable player_name is a character and the other variables are real.

head(historical_players.df) %>%
  kable("latex", booktabs = T,
        caption = "The first observations of our dataset.
        \\label{tab:tableau1}")%>%
  kable_styling(latex_options = c("striped", "scale_down","HOLD_position"))
tail(historical_players.df) %>%
  kable("latex", booktabs = T, 
        caption = "The last observations of our dataset.")%>%
  kable_styling(latex_options = c("striped", "scale_down","HOLD_position"))

Let us now briefly describe the data set. The data set consists of 1214 observations with 26 variables. However, this amount of data is too large for the purpose of this analysis. We will also perform a variable selection by choosing 26 of the available variables. The selected variables are:

Code - Designation Code - Designation Code - Designation
player_id: Player’s id fg3a: 3 Point Field Goals Attempted efg_pct: Effective Field Goal Percentage
player_name: Player’s name oreb: Offensive Rebounds ts_pct: True Shooting Percentage
gp: Games Played dreb: Defensive Rebounds fta: Free Throws Attempted
min: Minutes Played reb: Rebounds ft-pct: Free Throw Percentage
fgm: Field Goals Made ast: Assists fg3_pct: 3 Point Field Goal Percentage
fga: Field Goals Attempted stl: Steals pts: Points
fg_pct: Field Goal Percentage blk: Blocks ast_tov: Assist to Turnover Ratio
fg3m: 3 Point Field Goals Made tov: Turnovers stl_tov: Steals to turnover
pf: Personal Fouls
Variables Description
player_id The identification number of each player.
player_name The name of each player.
gp The number of games played.
min The number of minutes played by a player or a team.
fgm The percentage of a team’s goals that a player has made on the field.
fga The percentage of the team’s field goals a player attempted while on the field.
fg_pct The percentage of shot attempts a player makes on goal. Its formula is: \(\frac{\mbox{fgm}}{\mbox{fga}}\)
fg3m The number of 3-point goals a player or team has made.
fg3a The number of 3-point field goals a player or team has attempted.
fg3_pct The percentage of 3-point field goal attempts a player makes. Its formula is: \(\frac{\mbox{fg3m}}{\mbox{fg3a}}\)
fta The number of free throws a player or team has attempted.
ft-pct The percentage of free throw attempts that a player or team has made.
oreb The number of rebounds a player or team collected while on offense.
dreb The number of rebounds a player or team collected while on defense.
reb Bounces offered by a player or team while they were on the court.
ast The number of assists (passes that lead directly to a scored basket) made by a player.
stl The number of times a defensive player or team takes the ball from an offensive player causes a turnover.
blk A block occurs when an offensive player attempts a shot, and the defensive player knocks the ball away, blocking their chance to score.
tov A turnover occurs when the attacking player or team loses the ball to the defense.
pf The number of personal fouls committed by a player or a team.
pts The number of points scored.
ast_tov The number of assists for a player or team compared to the number of turnovers they committed.
stl_tov Team losses are credited to the defensive player who caused the turnover.
efg_pct Measures field goal percentage, taking into account that 3-point field goals are 1.5 times more valuable than 2-point field goals. Its formula is: \(\frac{(\mbox{fgm}+(0.5 \times \mbox{fg3m})}{\mbox{fga}}\)
ts_pct A percentage shot that takes into account the value of three-point field goals and free throws in addition to conventional two-point field goals. Its formula is: \(\frac{\mbox{pts}}{2\times (\mbox{fga}+0.44\times\mbox{fta})}\)

Each variable was converted to a “per game” measure (e.g., points per game calculated as the number of points scored divided by the total games played) for normalization. From the data structure output and the table’s initial observations (\(\ref{tab:tableau1}\)), it is clear that there are missing values, so we need to deal with them later.

Data cleaning

Missing values

We performed a missing value analysis and treatment of the data set:

# Indicate observed and missing values
#is.na(historical_players.df)

Identify the position of columns with at least one missing value:

# check for missing values on each column (variable)
which(colSums(is.na(historical_players.df))>0)
##    fg3m    fg3a fg3_pct    oreb    dreb     stl     blk     tov ast_tov stl_tov 
##       8       9      10      14      15      18      19      20      23      24

Count the number of missing values per column:

colSums(is.na(historical_players.df))
##   player_id player_name          gp         min         fgm         fga 
##           0           0           0           0           0           0 
##      fg_pct        fg3m        fg3a     fg3_pct         ftm         fta 
##           0         212         212         212           0           0 
##      ft_pct        oreb        dreb         reb         ast         stl 
##           0         119         119           0           0         119 
##         blk         tov          pf         pts     ast_tov     stl_tov 
##         119         187           0           0         187         187 
##     efg_pct      ts_pct 
##           0           0

To view missing values in a table:

kable(historical_players.df %>%
  select(everything()) %>%  
  summarise_all(funs(sum(is.na(.)))))%>%
  kable_styling(latex_options = c("striped", "scale_down","HOLD_position"))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
player_id player_name gp min fgm fga fg_pct fg3m fg3a fg3_pct ftm fta ft_pct oreb dreb reb ast stl blk tov pf pts ast_tov stl_tov efg_pct ts_pct
0 0 0 0 0 0 0 212 212 212 0 0 0 119 119 0 0 119 119 187 0 0 187 187 0 0

We visualize the missing values graphically on histograms. We will visualize the average of the missing values according to the observations (rows) and the variables (columns), to see the distribution of the missing values in the bar plots:

hist(rowMeans(is.na(historical_players.df)),col = "Brown")

barplot(colMeans(is.na(historical_players.df)), las=2,col = "Brown",
        main = "Histogram of colMeans(is.na(historical_players.df))")

According to the missing values table above and the two histograms, we see that there are some columns, namely fg3m, fg3a, fg3_pct, oreb, dreb, stl, blk, tov, ast_tov, stl_tov with several missing values respectively 212, 212, 212, 119, 119, 119, 187, 187 There are several methods to deal with missing values, among them deletion, replacement by the mean or linear interpolation… etc. But to choose one of these methods, it is necessary to understand the reason for the existence of the missing values through an analysis. In our case, we will only remove the “NA” lines. These lines are redundant.

# Skip players whose information is NA in any variable
historical_players.df =
  historical_players.df[rowSums(is.na(historical_players.df)) == 0,]
dim(historical_players.df)
## [1] 1002   26

We now have 1002 observations and 26 variables cleaned by missing values.

Outliers

Box and bar graphs (box plots) or boxplots represent statistical data visually. They allow you to visualize variations in the data distribution and general trends. We can now check for outliers on each variable in the previous data cleaning by using boxplots:

# We eliminate the first two variables because one is id and the other qualitative
boxplot(historical_players.df[, -2:-1], las=2, col="darkblue")
# To scale or not?
boxplot(scale(historical_players.df[, -2:-1]), las=2, col="darkblue")

We notice outliers on the box plots of the 24 variables in our dataset. These values can be considered outliers and can have a significant impact on the results of the analysis. Understanding the cause of these outliers is critical to avoid drawing incorrect conclusions. These outliers should be checked to ensure that they are not errors - they may be real cases - and to understand their impact on the analysis results.

We will identify outliers for each variable:

boxplot.stats(historical_players.df$gp)$out
##  [1] 1476 1560 1394 1392 1389 1462 1611 1380 1359 1504 1391 1424
boxplot.stats(historical_players.df$min)$out
## numeric(0)
boxplot.stats(historical_players.df$fgm)$out
##  [1] 11.37313  9.19346  9.82375  9.26368 10.17067  9.16531  9.27654 10.15192
##  [9]  9.57748  9.40274  9.38691  9.13350  8.95881  8.93462
boxplot.stats(historical_players.df$fga)$out
##  [1] 22.88899 19.60697 21.77899 19.90771 19.46508 20.10149 19.54098 19.32441
##  [9] 21.31459 20.35642
boxplot.stats(historical_players.df$fg_pct)$out
##  [1] 0.582 0.585 0.599 0.572 0.572 0.677 0.594 0.571 0.574 0.608 0.594 0.580
## [13] 0.584
boxplot.stats(historical_players.df$fg3m)$out
##  [1] 3.33972 2.24715 2.86207 2.28692 2.00223 2.18905 2.07927 2.11486 2.12950
## [10] 2.01591 1.98739
boxplot.stats(historical_players.df$fg3a)$out
## [1] 7.62892 6.18049 5.57428 6.83405 5.71462 5.46269 5.46748 5.51520 5.58813
boxplot.stats(historical_players.df$fg3_pct)$out
## [1] 0.667 1.000 1.000
boxplot.stats(historical_players.df$ftm)$out
##  [1] 6.83489 7.01991 6.10179 6.97484 5.74083 6.63076 6.22437 5.61546 5.89344
## [10] 7.15393 5.41641 4.91715 5.91038 5.94611 6.87154 5.91705 5.53045 5.09613
## [19] 5.54004 6.11449 5.84438 5.21558 6.41911 5.28134 5.32506 5.15115 5.47461
## [28] 5.17311 5.00112 5.11317 5.00182 4.89598 5.32420 4.83507 5.36463 4.89623
## [37] 4.98739 4.84129 5.56832
boxplot.stats(historical_players.df$fta)$out
##  [1] 8.18284 7.96017 8.24882 8.93654 6.80531 8.93496 7.43759 6.92551 7.24898
## [10] 8.74450 6.60182 9.32229 7.69508 7.22904 8.05041 8.05499 7.10413 7.01699
## [19] 7.52977 8.30902 7.81326 6.49275 8.34462 6.62396 7.06998 6.38719 6.45695
## [28] 6.69708 6.43499 6.57534 6.38231 8.65304 6.74053 6.34543 6.99523 6.77025
boxplot.stats(historical_players.df$ft_pct)$out
##  [1] 0.527 0.430 0.535 0.531 0.500 0.414 0.522 0.529 0.521 0.554 0.538 0.553
## [13] 0.498 0.485 0.478 0.499 0.528 0.458 0.444 0.494 0.551 0.486 0.527 0.467
## [25] 0.459
boxplot.stats(historical_players.df$oreb)$out
## [1] 3.97018 5.06471 4.75192 3.85164
boxplot.stats(historical_players.df$dreb)$out
##  [1] 7.72764 7.58192 8.04571 7.36537 7.72227 7.12694 7.84653 7.90554 7.51165
## [10] 7.48521 7.79978 7.04332 7.13394 8.06897 7.33577 8.38194 7.83379 9.77481
## [19] 9.11635 7.17492 7.14866 7.47425 8.16239 7.09831 7.68712 8.41624 7.14967
## [28] 7.09353 7.00000 8.36992
boxplot.stats(historical_players.df$reb)$out
##  [1] 11.17949 10.85253 11.69245 11.10501 10.81520 10.63526 12.49348 12.19865
##  [9] 10.84124 11.47917 13.63446 12.67191 10.51923 10.83231 13.99289 13.12184
boxplot.stats(historical_players.df$ast)$out
##  [1]  7.03205  6.82404  7.92365 11.19316  7.64894  9.25536  9.22000  7.39269
##  [9]  9.89329  9.13061  8.18339  6.71610  8.06982  7.21557  8.31510  6.73546
## [17]  7.01259  8.49219  6.71766  7.30073 10.50931  8.69231  8.06373  6.71390
## [25]  6.95474  8.50840  7.97377  7.43462  7.56580
boxplot.stats(historical_players.df$stl)$out
##  [1] 2.34515 2.16958 1.99278 2.03223 1.90287 1.90092 1.89694 2.29257 1.98545
## [10] 1.95840 2.63129 2.71117 2.21543 2.33408 2.17088 1.92955 2.09809 1.95577
## [19] 1.93973 1.93970

…etc.

We see that they all correspond to values that correspond to real players. This means that these are values that players can achieve, so we will not remove these outliers.

Variable selection

After cleaning the data, we will select a subset of all the variables to continue the analysis. Nevertheless, the data cleaning performed in the previous section can be useful for further analysis of the whole data set.

nba = historical_players.df[,3:ncol(historical_players.df)]
nba = as.data.frame(sapply(nba, as.numeric ))
names = historical_players.df[,2]

dim(nba)
## [1] 1002   24
# The variables we will keep
names(nba)
##  [1] "gp"      "min"     "fgm"     "fga"     "fg_pct"  "fg3m"    "fg3a"   
##  [8] "fg3_pct" "ftm"     "fta"     "ft_pct"  "oreb"    "dreb"    "reb"    
## [15] "ast"     "stl"     "blk"     "tov"     "pf"      "pts"     "ast_tov"
## [22] "stl_tov" "efg_pct" "ts_pct"

We will eliminate the first two variables, palyer_id and player_name, in our clean dataset since the first one contains no training and the other is qualitative; we will select the 24 quantitative variables.

Data analysis, exploration and visualization

In this section, we will conduct an exploratory analysis with the dataset we constructed in the previous sections. To better understand the dataset, we will use.

Descriptive statistics

We can calculate the minimum, first quartile, median, mean, third quartile, and a maximum of all numeric variables in our data set using summary() :

summary(nba)
##        gp              min              fgm               fga        
##  Min.   : 400.0   Min.   : 8.947   Min.   : 0.6232   Min.   : 1.531  
##  1st Qu.: 532.5   1st Qu.:20.489   1st Qu.: 2.7673   1st Qu.: 5.963  
##  Median : 682.5   Median :25.101   Median : 3.7753   Median : 8.220  
##  Mean   : 718.5   Mean   :25.163   Mean   : 4.0994   Mean   : 8.819  
##  3rd Qu.: 861.0   3rd Qu.:29.802   3rd Qu.: 5.2043   3rd Qu.:11.172  
##  Max.   :1611.0   Max.   :41.120   Max.   :11.3731   Max.   :22.889  
##      fg_pct            fg3m               fg3a            fg3_pct      
##  Min.   :0.3730   Min.   :0.000000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.4370   1st Qu.:0.006258   1st Qu.:0.04473   1st Qu.:0.1520  
##  Median :0.4610   Median :0.137780   Median :0.47784   Median :0.2945  
##  Mean   :0.4645   Mean   :0.430171   Mean   :1.22112   Mean   :0.2492  
##  3rd Qu.:0.4900   3rd Qu.:0.793380   3rd Qu.:2.20637   3rd Qu.:0.3510  
##  Max.   :0.6770   Max.   :3.339720   Max.   :7.62892   Max.   :1.0000  
##       ftm              fta             ft_pct            oreb       
##  Min.   :0.2141   Min.   :0.2441   Min.   :0.4140   Min.   :0.1456  
##  1st Qu.:1.1963   1st Qu.:1.6486   1st Qu.:0.7060   1st Qu.:0.6621  
##  Median :1.7713   Median :2.4028   Median :0.7630   Median :1.1722  
##  Mean   :2.0709   Mean   :2.7368   Mean   :0.7492   Mean   :1.3260  
##  3rd Qu.:2.6476   3rd Qu.:3.5213   3rd Qu.:0.8060   3rd Qu.:1.8559  
##  Max.   :7.1539   Max.   :9.3223   Max.   :0.9050   Max.   :5.0647  
##       dreb             reb              ast               stl        
##  Min.   :0.7673   Min.   : 1.007   Min.   : 0.1975   Min.   :0.1225  
##  1st Qu.:2.0071   1st Qu.: 2.781   1st Qu.: 1.1235   1st Qu.:0.5405  
##  Median :2.8147   Median : 4.031   Median : 1.9202   Median :0.7761  
##  Mean   :3.1508   Mean   : 4.492   Mean   : 2.4371   Mean   :0.8537  
##  3rd Qu.:3.9814   3rd Qu.: 5.833   3rd Qu.: 3.3430   3rd Qu.:1.0794  
##  Max.   :9.7748   Max.   :13.993   Max.   :11.1932   Max.   :2.7112  
##       blk               tov               pf              pts        
##  Min.   :0.00996   Min.   :0.1734   Min.   :0.7944   Min.   : 1.729  
##  1st Qu.:0.18856   1st Qu.:1.0697   1st Qu.:1.9307   1st Qu.: 7.144  
##  Median :0.34270   Median :1.5038   Median :2.2904   Median : 9.769  
##  Mean   :0.52666   Mean   :1.5942   Mean   :2.3360   Mean   :10.700  
##  3rd Qu.:0.69759   3rd Qu.:2.0263   3rd Qu.:2.7452   3rd Qu.:13.634  
##  Max.   :3.50171   Max.   :3.9222   Max.   :3.8665   Max.   :30.123  
##     ast_tov          stl_tov          efg_pct           ts_pct      
##  Min.   :0.2445   Min.   :0.1320   Min.   :0.4069   Min.   :0.4294  
##  1st Qu.:0.9114   1st Qu.:0.4061   1st Qu.:0.4675   1st Qu.:0.5105  
##  Median :1.3036   Median :0.5208   Median :0.4866   Median :0.5295  
##  Mean   :1.4672   Mean   :0.5600   Mean   :0.4885   Mean   :0.5307  
##  3rd Qu.:1.9264   3rd Qu.:0.6748   3rd Qu.:0.5066   3rd Qu.:0.5504  
##  Max.   :4.6936   Max.   :1.9345   Max.   :0.6771   Max.   :0.6433

If we need more descriptive statistics, we use stat.desc() from the {pastecs} package:

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:data.table':
## 
##     first, last
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## The following object is masked from 'package:tidyr':
## 
##     extract
stat.desc(nba) %>%
  kable("latex", booktabs = T, 
        caption = "Descriptive statistics.")%>%
  kable_styling(latex_options = c("striped", "scale_down","HOLD_position"))

According to the results of the descriptive statistics tables, we can see that the maximum number of games played by the players is 1611, with a maximum number of points scored of 30.12. This is explained by the number of offensive (14) and defensive (97.75) rebounds. Similarly, the maximum number of personal fouls committed by a player or a team is 3.87. Finally, the maximum number of minutes played is 41.12 minutes.

Methodology

In this section, we will demonstrate that the selection of variables is not random but based on a study of the correlation between them. For this reason, we used the technique of principal component analysis (PCA) to reduce the dimension and keep only the relevant and information-rich variables.

Principal component analysis (PCA)

Correlation between variables

Let’s examine these variables’ correlations to determine which axis to use when constructing and evaluating our PCA. The correlation graph between the different statistical categories will allow us to determine if there are any interesting interactions between them. To do this, we will use the ggcorrplot library to visualize the correlation between each variable. The Pearson correlation coefficient, often abbreviated to \(r\), is a statistical indicator measuring the linear relationship between two data sets. It is also known as bivariate correlation, Pearson product-moment correlation coefficient, or simply correlation coefficient. The symbol \(\rho\) often represents the formula used to calculate this measure.Considering a pair of random variables \((X,Y)\), the formula for \(\rho\) is :

\[ \rho _{X,Y}=\frac { \mbox{cov(X,Y)}}{\sigma_{X}\sigma_{Y}}\]

where: \(mbox{cov}\) is the covariance. \(sigma_{X}\) is the standard deviation of \(X\). \(sigma_Y\) is the standard deviation of \(Y\).

We examine the matrix and correlation graphs to see if there are any interesting relationships between the variables.

library(ggcorrplot)

# Calculate the correlation matrix
cor_matrix <- cor(nba)
cor_matrix %>%
  kable("latex", booktabs = T, 
        caption = "Correlation of NBA Stats.")%>%
  kable_styling(latex_options = c("striped", "scale_down","HOLD_position"))

It can be seen from the correlation table of the NBA statistics that most of the variables are correlated with each other, except for two variables that have a low correlation: stl_tov and gp.

The corrplot package allows plotting a correlation matrix with a confidence interval. It is also very efficient to visualize the details of the correlation between the variables on the graphs we use:

library(ggstatsplot)
library(RColorBrewer)
library(corrplot)
library(GGally)
# Create the correlation graph
corrplot(cor_matrix, type = "lower", order = "hclust",
         col = brewer.pal(n = 8, name = "RdBu"))

corrplot(cor_matrix, diag = FALSE, order = "FPC",
         tl.pos = "td", tl.cex = 0.5, method = "color", type = "upper")

Note that our input has a dimension \(p = 24\), which implies \(p^{24}\) different relations between the variables, in total 576 relations, but no way to get information. This is why we need an analytical tool to reduce the dimension. Indeed, let us notice on the graphs that there are strongly correlated variables, especially the most related ones (like ftm and fta).

We use the corr_cross() function to calculate all the correlations and return the highest and most significant ones in a graph:

# devtools::install_github("laresbernardo/lares")
library(lares)
Sys.unsetenv("LARES_FONT") # Temporal
corr_cross(nba, # name of the data set
  max_pvalue = 0.05, # have only significant correlations (at the 5% level)
  top = 10 # display the first 10 pairs of variables (by correlation coefficient)
)

The cross-correlation plot above shows that the first 10 pairs of variables are positively correlated at the \(5\%\) significance level. These pairs are: (fg3m, fg3a); (fgm, pts); (fgm, fga); (ftm, fta); (fga, pts); (dreb, reb); (oreb, reb); (ftm, pts); (efg_pct, ts_pct); (min, pts). Finally, the exploratory analysis of our data shows that the variables are positively correlated with each other.

Objective of the PCA

The objective of PCA is to simplify a data set by grouping the information contained in many variables into a smaller number of variables. Each variable in a dataset corresponds to a dimension of space, so if the data has three variables, it forms a 3-dimensional space. PCA summarizes the information contained in the variables in the data set by sequentially identifying the directions (or “components”) that minimize the average distance between the observations in the data set and that direction. Each direction is a linear grouping of variables. For example, in the case of a 3-variable data set, the direction would be expressed as a \(mbox{var}_1 + b\) \(mbox{var}_2 + c\) times \(\mbox{var}_3\). Therefore, if one component accurately summarizes the data set by minimizing the mean square distance sufficiently, we can reduce all the data points to a single value by introducing their values for the three variables into the linear combination, making a 3-dimensional point a 1-dimensional projection.

PCA is a technique used to reduce the dimensions of a data set by finding the principal component that minimizes the mean square distance and using a linear combination of the variables to reduce all data points to a single value. If a single component is insufficient, PCA continues to solve the same minimization problem among all components orthogonal to the previous one. This process continues until all the data are captured, which occurs when the number of components equals the number of variables. The objective of PCA is to summarize the data with a minimum of measures.

After selecting the numerical variables as raw data, they must be centered and normalized to have identical units for our PCA analysis; we want to use as few components as possible to explain most of the variance. Using too few components may not accurately account for the entire data set ( imagine your data points are far away from your first direction: using that one component to summarize your data set would lose a significant amount of information), but using too many components would result in overfitting the data. Here is the amount of information captured by each additional principal component:

# Scale/standardize the characteristics to place them on the
#the same scale (all with mean 0, standard deviation = 1)
pca = prcomp(nba, scale=T)
# the same thing, but using SVD instead of clean decomposition if using
#the following function:
# pca = princomp(nba, cor=T) 
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.8682 2.5256 1.51142 1.23009 1.1645 0.88156 0.84972
## Proportion of Variance 0.3428 0.2658 0.09518 0.06305 0.0565 0.03238 0.03008
## Cumulative Proportion  0.3428 0.6085 0.70373 0.76677 0.8233 0.85566 0.88574
##                            PC8     PC9    PC10   PC11   PC12    PC13    PC14
## Standard deviation     0.72053 0.70570 0.65472 0.6500 0.6060 0.41614 0.30767
## Proportion of Variance 0.02163 0.02075 0.01786 0.0176 0.0153 0.00722 0.00394
## Cumulative Proportion  0.90737 0.92812 0.94598 0.9636 0.9789 0.98611 0.99005
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.30271 0.23671 0.19467 0.17058 0.11126 0.06634 0.05948
## Proportion of Variance 0.00382 0.00233 0.00158 0.00121 0.00052 0.00018 0.00015
## Cumulative Proportion  0.99387 0.99620 0.99778 0.99899 0.99951 0.99969 0.99984
##                           PC22    PC23      PC24
## Standard deviation     0.04715 0.03983 1.207e-06
## Proportion of Variance 0.00009 0.00007 0.000e+00
## Cumulative Proportion  0.99993 1.00000 1.000e+00

Eigenvalues and eigenvectors are two different concepts in PCA analysis. Eigenvalues are numbers that indicate the importance of each principal component, while eigenvectors are vectors that describe the direction of the principal components. Eigenvalues are often expressed as a percentage to indicate the proportion of total variance explained by each principal component. The contributions of the variables to the principal components are calculated from the eigenvectors and can be used to interpret the contribution of each variable to each principal component.

p1_ess <- fviz_eig(pca, addlabels = TRUE, choice='eigenvalue',main ="Eigenvalues of 
                   the explained inertia")
p2_ess <- fviz_eig(pca, addlabels = TRUE, main="Proportion of variances")
grid.arrange(p1_ess, p2_ess, nrow=1,name = "arrange")

Three components explain \(70.37\%\) of the data, and another fourth direction defines another \(6.3\%\). Since the fourth component does not add much to our understanding, we will use the first three components to transform our 24-dimensional data set into a 3-dimensional one. Otherwise, we extracted the eigenvalues on the principal components (see the plot above left). We then visualised the percentage of variances explained (i.e., the information) of these eigenvalues in 10 dimensions (see the plot on the right). Therefore, we can project the data into these dimensions.

Note that the sum of the squares of the factors (eigenvectors) is equal to 1 because they are normalised vectors (module 1).

sum(pca$rotation[,1]^2)
## [1] 1

This means that the eigenvectors are more accessible to interpret than the eigenvalues. They are like percentages (numbers between 0 and 1). In the following section, we will plot these eigenvectors. They are called the contribution of the variables to the components.

The dimensions of the multivariate data are reduced through PCA, which can be represented visually with little loss of information. The three graphs below will be studied more thoroughly as they provide complete information on particular variables. To do this, we use the function fviz_pca_var :

fviz_pca_var(pca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid overlapping texts
             ,title= "Cos 2 of the variables NBA Stats - PCA"
             )

On the projection plane of the PCA of the variables above, we can observe the relationships between the statistical variables of the NBA players. The two dimensions represented on the graph explain \(34.3\%\) and \(26.6\%\) of the total variance of the data, respectively. It is clear that the variables min, pts, fgm, fga, tov, ast, ftm, and fta are negatively correlated with the first dimension, while the variables oreb, dreb, reb, and fg_pct are positively correlated with the second dimension. The distances of the variables from the original measure the degree of representation of these variables on the principal axes. Variables positively correlated with a high cos2 indicate a good representation of these axes. In contrast, variables negatively correlated with a low cos2 indicate that the principal axes do not perfectly represent the variable.

The contributions of the variables in a specific principal axis are expressed as a percentage. The larger the value of the contribution, the more the variable contributes to the principal component in question.

  • Variables correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the most important in explaining the variability in the data set.
  • Variables that are not correlated with any axis or that are correlated with the last few axes are low-contribution variables and can be removed to simplify the analysis in question.

To create a bar graph of the contribution of variables, the function fviz_contrib() can be used as follows:

# Contributions of variables to PC1
c1 <- fviz_contrib(pca, "var", axes=1)
# Contributions of variables to PC2
c2 <- fviz_contrib(pca, "var", axes=2)
# Contributions of variables to PC3
c3 <- fviz_contrib(pca, "var", axes=3)
# Contributions of variables to PC4
c4 <- fviz_contrib(pca, "var", axes=4)
# Combine the contributions of the variables to PC1, PC2, PC3, and PC4.
grid.arrange(c1, c2, c3, c4, top='Contribution of NBA stats variables to the principal components')

We can visualize the 4 principal components using barplot:

barplot(pca$rotation[,1], las=2, col="darkblue", main = "1st component (Dim-1)")
barplot(pca$rotation[,2], las=2, col="darkblue", main = "2nd component (Dim-2)")

barplot(pca$rotation[,3], las=2, col="darkblue", main = "3rd component (Dim-3)")
barplot(pca$rotation[,4], las=2, col="darkblue", main = "4th component (Dim-4)")

The results show that the variables pts, fgm, fta, min, ftm, fga are the main contributors in terms of the first factor. In terms of the second factor, the variables included are mainly oreb, ast_tov, fg3a, fg3_pct, fg3m, ft_pct. The third factor is only designated by variables efg_pct, ts_pct. Finally, the variables contibutor more of the last component are: fg_pct, fg3a, fg3m, ast_tov. We can also see that the variables fg3m, fg3a, fg_pct, contribute most to dimensions 2, 3, and 4. The red dotted line in the above graph indicates the expected average contribution. If the contribution of the variables were uniform, the expected value would be $refrac{1}{mbox{length(variables)}} = $refrac{1}{24} = \(4.16\%\).

We can now rank the players according to their first PC score: the best historical players in terms of performance :

calculateScore = function(data) {
  return(sum((pca$rotation[, 1]*data)^2))
}
historical_players.df$player_name[sort.int(apply(nba, 1, calculateScore), decreasing = T, 
                                           index.return = T)$ix[1:10]]
##  [1] "Robert Parish"       "Kareem Abdul-Jabbar" "John Stockton"      
##  [4] "Karl Malone"         "Kevin Garnett"       "Kevin Willis"       
##  [7] "Dirk Nowitzki"       "Tim Duncan"          "Jason Kidd"         
## [10] "Reggie Miller"

Perhaps we can get more information by ranking the players using this component:

calculateScore = function(data) {
  return(sum((pca$rotation[, 2]*data)^2))
}
nba$gp[sort.int(apply(nba, 1, calculateScore), decreasing = T, index.return = T)$ix[1:10]]
##  [1] 1611 1560 1504 1476 1462 1424 1392 1394 1391 1389

Let’s look at the contribution of each player to the components. For the \(i-mbox{th}\) player and the first component, the contribution is \(z^2_{1,i}/\lambda_1/n\), a number between 0 and 1.

head(get_pca_ind(pca)$contrib[,1]) # This is in %, i.e. between 0 and 100
##         1         2         3         4         5         6 
## 1.0588322 0.8509603 0.9762763 0.7415180 0.5860088 1.0777332
head((pca$x[,1]^2)/(pca$sdev[1]^2))/dim(nba)[1] # This number is between 0 and 1
## [1] 0.010588322 0.008509603 0.009762763 0.007415180 0.005860088 0.010777332

We will view the contributions of all players to the first component, let’s see the names of each player:

#  It is similar to names[order(pca$x[,1])][1:10] but in percentage
names[order(get_pca_ind(pca)$contrib[,1],decreasing=T)][1:10]
##  [1] "Karl Malone"         "Michael Jordan"      "Charles Barkley"    
##  [4] "LeBron James"        "Shaquille O'Neal"    "Kareem Abdul-Jabbar"
##  [7] "Kevin Durant"        "Hakeem Olajuwon"     "Larry Bird"         
## [10] "David Robinson"

Finally, to see the top 20 players in terms of contributions on the 4 main components:

names_z1 = names[order(get_pca_ind(pca)$contrib[,1],decreasing=T)]
fviz_contrib(pca, choice = "ind", axes = 1, top=20)+scale_x_discrete(labels=names_z1)
names_z2 = names[order(get_pca_ind(pca)$contrib[,2],decreasing=T)]
fviz_contrib(pca, choice = "ind", axes = 2, top=20)+scale_x_discrete(labels=names_z2)

names_z3 = names[order(get_pca_ind(pca)$contrib[,3],decreasing=T)]
fviz_contrib(pca, choice = "ind", axes = 3, top=20)+scale_x_discrete(labels=names_z3)
names_z4 = names[order(get_pca_ind(pca)$contrib[,4],decreasing=T)]
fviz_contrib(pca, choice = "ind", axes = 4, top=20)+scale_x_discrete(labels=names_z4)

We find that all of the 20 selected players contributed strongly to the four main components of our study. In particular, the players who contributed the most were Karl Malone, Michael Jordan, Dwight Howard, DeAndre Jordan, Steve Novak, John Stockton and Maurice Cheeks. Thus, based on these results, we can say that the best NBA players of all time are those with a wide variety of basketball skills.

We will use the Biplot to project the observations and the variables in the same plane (using the first two components). For an elegant visualisation of the individuals and variables on the PCA plane, we will use the following function fviz_pca_biplot:

fviz_pca_biplot(pca, repel = TRUE,col.var = "contrib", title = "Individuals and variables - PCA")
## Warning: ggrepel: 918 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Recall that for the \(j-\)mbox{th}$ principal component: \(Z_j = X a_j\), \(a_j\) represents the eigenvectors, and \(Z_j\) represents the scores or factors of each principal component. We will plot the first two scores (i.e. PC1, PC2), using colors for minutes played (min) and games played (gp) :

data.frame(z1=-pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=names, ,color=nba$min)) + geom_point(size=0) +
  labs(title="Player ranking by minutes played on the PCA", x="Dim1 (34.3%)", 
       y="Dim2 (26.6%)") +
  theme_bw() + scale_color_gradient(low="lightblue", high="darkblue")+
  theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0,
                                              check_overlap = TRUE)+
  guides(color = guide_legend(title = "Minutes played"))

data.frame(z1=-pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=names,color=nba$gp)) + geom_point(size=0) +
  labs(title="Ranking of players per game played on the PCA",x="Dim1 (34.3%)",
       y="Dim2 (26.6%)") +
  theme_bw() + scale_color_gradient(low="green", high="black")+
  theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0,
                                              check_overlap = TRUE)+
  guides(color = guide_legend(title = "Games played"))

We can see here the distribution between the players. The best NBA players have a very high PC1 score and are on the right. The forwards have a very high PC2 and are on the top. There are also differences and similarities based on the position between players based on overall statistics. For example, Magic Johnson and Michael Jordan are close to each other, indicating that famous players can be grouped. So we can rank any player based on these two PCA graphs and make distinctions. The ones on the top right are the best offences, such as; DeAndre Jordan, Dwight Howard, Shaquille O’Neal, Artis Gilmore, Dikembe Mutombo, Moses Malone, Alonzo Maurning, Kareem Abdyl-Jabbar…etc. And the ones on the bottom right are the best defences like Stephen Curry, Chris Paul, Alen Iverson, Gilbert ArenasJames Harden, Kobe Bryant, Russell Westbrook, Michael Jordan…etc.

Clustering

In the second part of this project, we will use the two clustering methods, namely hierarchical clustering and K-means, on our NBA player performance database to answer the second question: how to help NBA coaches make more relevant decisions and evaluate whether player choices are the right ones. Both methods are unsupervised learning because we are looking for a new set of labels (clusters) for our data. It would be possible to group players into different groups based on their performance, such as players who played more minutes and games, were influential on defence, etc. This could be useful for statistical analysis and decision-making in-game strategy and team composition.

Clustering hierarchical

Hierarchical clustering is a method of grouping data based on the proximity of elements to each other. It creates a data clustering tree, called a dendrogram, where each node represents a group of data and each branch represents the proximity between these groups. The process we will follow for this method is as follows:

  • We will calculate the Euclidean distance matrix for the variables in our dataset.

  • We will attempt hierarchical clustering using the nearest neighbour and centroid clustering methods and then Ward’s minimum variance method.

The first clustering method we will try is hierarchical clustering. Dendrograms (tree diagrams) can help visualise the separation of NBA players. Clusters of players are determined using the distances between observations, trying to maximise the space between groups and minimise the distance within groups. The most commonly used distance formula is the Euclidean distance:

\[\mbox{distance(x,y)}=\sqrt{\sum_i^n(x_i-y_i)}\] We work in the quantitative statistical variables of the NBA using the z-score formula for the normal distribution:

\[x_{new}=\frac{x_i-\bar{x}}{\sigma_x}\]

# Compute the Euclidean distance matrix for the features :
nba_dist <- dist(nba, method = 'euclidean')
  
# Try clustering by single link, centroid, and ward (D2):
hcl_single <- hclust(d = nba_dist, method = 'single')
hcl_centroid <- hclust(d = nba_dist, method = 'centroid')
hcl_ward <-  hclust(d = nba_dist, method = 'ward.D2')
# Nearest neighbor method:
plot(hcl_single, hang = -1, main = 'Liaison simple', xlab = '', sub = '')

# Centroid groups :
plot(hcl_centroid, hang = -1, main = 'Link with the centroid', xlab = '',  sub = '')

# Ward's minimum variance method,
# with squared dissimilarities before clustering :
dend <- as.dendrogram(hcl_ward)
hcl_k <- 4
dend_col <- color_branches(dend, k = hcl_k)
plot(dend_col, main = paste0('Link with Ward (D2): K = ', hcl_k))

Ward’s dendrogram appears to be the best of the three methods because of the clear separation between clusters. The simple linkage method and the centroid linkage method do not provide a clear separation of sets and produce many sparsely populated clusters. Since Ward’s dendrogram is the best of the three, we will study its distribution for three and four clusters, providing the best separation level.

Let’s look at the distribution of players in each cluster for the solutions \(k = 3\) and \(k = 4\) :

# Add the cluster assignments for the solutions k = 3 and k = 4 for the model  
# Hierarchical Clustering method to the main data set:
historical_players.df$hcl_ward_labs_three <- cutree(hcl_ward, k = 3)
historical_players.df$hcl_ward_labs_four <- cutree(hcl_ward, k = 4)

# Organize in a data frame:
hcl_df_three <- data.frame(table(historical_players.df$hcl_ward_labs_three))
hcl_df_four <- data.frame(table(historical_players.df$hcl_ward_labs_four))
col_names <- c('Clusters', 'Count')
colnames(hcl_df_three) <- col_names
colnames(hcl_df_four) <- col_names

hcl_df_three[4,] <- NA
df_clusters <- data.table(hcl_df_four$Clusters, hcl_df_three$Count, hcl_df_four$Count) 

#Change column names
df_clusters <- df_clusters %>% 
    dplyr::rename(Clusters = V1, 'k = 3' = V2, 'k = 4' = V3)
df_clusters %>%
  kable("latex", booktabs = T, 
        caption = "The clusters of the hierarchical clustering method")%>%
  kable_styling(latex_options = c("striped","HOLD_position"),font_size = 20)

The distribution of players between clusters is uneven in both solutions, with the fewest players in the 3rd and 4th clusters for \(k = 3\) and \(k = 4\) clusters, respectively.

We will create a function to plot the PCA and label the points:

plot_pca <- function(object, frame = FALSE, x = 1, y = 2, 
                     data, colour, title, label) {#, leg_title) {
  #  plots the data in the PCA space
  # object = PCA or K-Means object
  # x = which PC for x-axis (1, 2, ,3, etc..)
  # y = which PC for y-axis (1, 2, 3, etc..)
  # data = underlying data
  p <- autoplot(pca, x = x, y = y, data = historical_players.df, colour = colour, 
                frame = frame) + 
        ggtitle(title) + 
        # center title
        theme(plot.title = element_text(hjust = 0.5)) + 
        geom_label_repel(aes(label = label),
                        box.padding   = 0.35, 
                        point.padding = 0.5,
                        segment.color = 'grey50')
  return(p)
}
# Factor the labels into the data set:
historical_players.df$hcl_ward_three <- factor(cutree(hcl_ward, k = 3))
historical_players.df$hcl_ward_four <- factor(cutree(hcl_ward, k = 4))

# Include some player names in the graph:
hcl_labels <- ifelse(historical_players.df$min >= 36 | historical_players.df$min <= 2.5 |
                     (historical_players.df$min >= 28.8 & historical_players.df$min <= 29) | 
                     (historical_players.df$min >= 25 & historical_players.df$min <= 25.2),
                     as.character(historical_players.df$player_name), '' )

# elements to be browsed in a loop
hcl_labs <- names(historical_players.df %>% select(tail(names(.), 2)))
hcl_ks <- c(3, 4)

# Plot the hclust labels overlaid on the PCA
for (i in seq_along(hcl_labs)) {
  p <- plot_pca(pca, frame = TRUE, 
                data = historical_players.df, colour = hcl_labs[i],
                title = paste0('ACP: ', hcl_ks[i], ' clusters (Hclust)'),
                label = hcl_labels
  )

  print(p)
}
## Warning: ggrepel: 46 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 46 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

When viewing the clusters in PC space, we can see a clear separation between the 3- and 4-cluster solutions, with some overlapping points. The hierarchical clustering graphs for \(k=3\) and \(k=4\) indicate the types of players to consider more closely to help the coach make the right decision. The ideal players for the NBA team would be Allen Iverson, Karl Malone, Charles Barkley, and Kareem Abdul-Jabbar, who play between 36 minutes and 2.5 minutes, between 28.8 minutes and 29 minutes, and between 25 minutes and 25.2 minutes, respectively, and who often start games and score many points.

K-Means Clustering

K-means is a clustering method based on the distribution of data around k centers of gravity. The data are assigned to the group whose center of gravity is the closest. This method is efficient for data with a clearly defined global structure and requires to know in advance the desired number of groups (k). In our case, we explored the use of K-Means for clustering with a 4 cluster solution based on the Silhouette index to evaluate the separation between clusters. The graphs below show:

  1. The Silhouette index for the K-Means method.
  2. Cluster assignments in the PC space.
  3. The distribution of players between clusters.

The following three functions allow to visualize the ratio between the total sum of squares within (wss), the sum of squares between (bss) and the total sum of squares (totss) :

wss <- function(data, maxCluster = 10) {
  # Initialize the sum of squares
  set.seed(50)
  SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
  SSw <- vector()
  for (i in 2:maxCluster) {
    SSw[i] <- sum(kmeans(data, centers = i)$tot.withinss)
  }
  plot(1:maxCluster, SSw, type = "o", xlab = "Number of clusters", 
       ylab = "Total sum of squares", pch=19)
}

bss <- function(data, maxCluster = 10) {
  #Initialize the sum of squares
  set.seed(50)
  SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
  SSw <- vector()
  for (i in 2:maxCluster) {
    SSw[i] <- sum(kmeans(data, centers = i)$betweenss)
  }
  plot(1:maxCluster, SSw, type = "o", xlab = "Number of clusters",
       ylab = "Between the sum of the squares", pch=19)
}

btratio <- function(data, maxCluster = 10) {
  # Initialize the sum of squares
  set.seed(50)
  SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
  SSw <- vector()
  for (i in 2:maxCluster) {
    SSw[i] <- sum(kmeans(data, centers = i)$betweenss/kmeans(data, centers = i)$totss)
  }
  plot(1:maxCluster, SSw, type = "o", xlab = "Number of clusters",
       ylab = "Ratio Entre_SS/Total_SS", pch=19)
}
# obtaining k optimum
wss(nba)
bss(nba)
btratio(nba)

The graphs indicate that 4 is the optimal number of \(k\). Beyond \(k=4\), increasing the number of \(k\) does not result in a significant reduction in the sum of squares within (strong internal cohesion) nor does it result in a significant increase in the sum of squares between and the ratio of between to the total sum of squares (maximum external separation).

We will calculate the optimal number of clusters using the Silhouette index by the function NbClust. Then we will plot the silhouette index as a function of the cluster size:

cluster_sizes_km_s <- NbClust(data = nba,
                          #  it will probably be more difficult to interpret clusters beyond 
                          #this number
                          max.nc = 6,
                          method = 'kmeans',
                          index = 'silhouette')

plot(names(cluster_sizes_km_s$All.index),
     cluster_sizes_km_s$All.index,
     main = 'Silhouette index : KKM',
     type = 'l',
     xlab = "Cluster size (Number)",
     ylab = "Index value")

The Silhouette index plot also shows that 4 is the optimal number of k. The characteristics of each cluster can also be visualized by combining clustering and PCA in the following graph.

# Compute the k-means for k = 4 clusters :
km_k <- 4
km_four <- kmeans(x = nba,
            centers = km_k,
            nstart = 100,
            algorithm = 'Hartigan-Wong')

# Add cluster assignments to the original dataset:
historical_players.df$km_labs_four <- factor(km_four$cluster)

# Swap labels 3 and 4 so that label 4 is the best player:
#nba$km_labs_four <- factor(ifelse(nba$km_labs_four == 4, 3, 
#                          ifelse(nba$km_labs_four == 3, 4, 
 #                         nba$km_labs_four)))

# Plotting K-Means clusters in PC space:

# Add player labels to the chart - players who have played more than 36 min
#per game or less than 3 min per game:
km_labels <- ifelse(historical_players.df$min >= 36 | historical_players.df$min <= 3, 
                         as.character(historical_players.df$player_name), '' )
# clustering visualization
plot_pca(km_five, data = historical_players.df, frame = TRUE, colour = 'km_labs_four',
          title = paste0('ACP: ', km_k, ' clusters (K-means)'),
          label = km_labels 
         #leg_title = 'Clusters'
         )
## Warning: ggrepel: 24 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

This profiling graph shows the following findings on players playing more than 36 minutes or less than 3 minutes per game:

  • Cluster 1: is dominated by Kareem Abdul-Jabbar and features mostly players who have played less than 36 minutes or more than 3 minutes per game.
  • Cluster 2: is dominated by Charles Barkley and features fewer players who have played less than 36 minutes or more than 3 minutes per game.
  • Cluster 3: is dominated by Karl Malone and has about the same number of players who played less than 36 minutes or more than 3 minutes per game.
  • Cluster 4: is dominated by Allen Iverson and has mostly players who have played more than 36 minutes or less than 3 minutes per game.
km_labs <- data.frame(table(historical_players.df$km_labs_four))
colnames(km_labs) <- col_names

#Affichage le tableau
km_labs %>%
  kable("latex", booktabs = T, 
        caption = "The number of players in each cluster according to the k-means method")%>%
  kable_styling(latex_options = c("striped","HOLD_position"),font_size = 20)
nba_km_avg <- historical_players.df %>%
  select(km_labs_four, min, pts, oreb, dreb,
         ast, blk, stl) %>%
  group_by(km_labs_four) %>% 
  summarise_all(list(mean))

nba_km_avg %>%
  kable("latex", booktabs = T, 
        caption = "Average stats of NBA players in the clusters of the k-mean method")%>%
  kable_styling(latex_options = c("striped", "scale_down","HOLD_position"))

The following are the means of the original variables (min, pts, oreb, dreb, ast, blk, stl) for each player grouping created. We have reduced the data to fewer dimensions to get a clearer grouping, but we still retain all the original information to understand the player groupings in the context of understandable basketball statistics.

Cross-tabulation of label evaluations to compare hierarchical and K-means methods:

xtab_hcl_km <- xtabs(~historical_players.df$hcl_ward_four + historical_players.df$km_labs_four)

xtab_hcl_km %>%
  kable("latex", booktabs = T, 
        caption = "Cross-tabulation between the hierarchical and K-means methods")%>%
  kable_styling(latex_options = c("striped","HOLD_position"),font_size = 20)

According to the table above, the K-Means method with four clusters seems to be the optimal solution, as it produced a clearer separation of players based on their overall skills.

Results, discussion and conclusion

In conclusion, we showed an unsupervised learning analysis of NBA player statistics. We used K-means and hierarchical clustering and PCA dimensionality reduction. We can summarize that:

References

[1] Factor analysis of mixed data in R

[2] PCA For Dimensionality Reduction

[3] Predicting breast cancer using PCA + LDA in R

[4] K-Means Clustering and PCA with Visualization

[5] Dimensionality Reduction on Cereals

[6] Clustering NBA Player Types: A Tutorial on K-Means, Gaussian Mixture Models, Principal Component Analysis, and Graphical Networks

[7] Clustering Lab – NBA Salaries Part

[8] Clustering Basketball Players by Positio

[9] Redefining NBA Player Classifications using Clustering

[10] NBA Clustering Analysis

[11] PCA and Clustering Analysis of NBA Shot Selections

[12] NBA data cleaning (my first kernel)

[13] NBA data preparation

[14] PCA on NBA Players with R

[15] An Introduction to corrplot Package


  1. Ph.D. student in Time Series Forecasting, LMSA, Department of Mathematics, Faculty of Sciences Rabat Mohammed V University, Morocco, ↩︎