Introduction
Description of the Dataset
Data Preparation
Principal Component Analysis
Clustering of PCA Results from old and modern seasons
Market Basket Analysis
Conclusions
References

1. Introduction - The Problem of Many Dimensions

Each and every day enormous amounts of data are being collected. One may think, that the more data we have, the better models we can build. However, working with many variables may actually lead to significant problems with machine learning techniques such as Clustering. This issue has even its own name The Curse of Dimensionality. Too many variables may create noise in the data, making it harder to work with. Not only the noise, but also the computation time of algorithms increases. Therefore it is important to choose the right tools for the job.

The goal of this analysis is to use unsupervised learning techniques, namely Principal Component Analysis (PCA), Clustering and Market Basket Analysis (MBA) to obtain some valuable insights about NBA players. PCA allows to reduce the number of dimensions of the data while retaining as much variance as possible. The task of clustering is to group the data points which are similar to each other and are different from the rest. MBA allows to mine rules in the dataset. If A => B.

I want to see how the NBA players have changed over the last forty years and what patterns may be discovered in the data! I hope that these methods will allow me to learn more about basketball and see what story does the data carry.

2. Description of the Dataset

I have used Python and Beautiful Soup package to scrap the data from the place where data science meets basketball. Basketball-reference.com is so far the best site to find lots of data about basketball. I decided to analyze player stats per game from 1980 to 2020. The reason behind this choice is that, in the 1979-1980 NBA season the 3-point line was first introduced. Moreover, the further back we go with basketball data, the more missing values there are.

My dataset contains around 21 000 observations and 29 variables:

Player - player’s name
Pos - position
Age - player’s age on February 1 of the season
Tm - team
G - number of played games
GS - number of games this player started
MP - minutes played per game
FG - field goals per game (in other words how many shots per game this player was making)
FGA - field goal attempts per game (how many baskets per game in total this player was attempting)
FG. - field goal percentage (FGA/FG)
3P - 3-point field goals per game
3PA - 3-point field goal attempts per game
3P. - 3-point field goal percentage
2P - 2-point field goals per game
2PA - 2-point field goal attempts per game
2P. - 2-point field goal percentage
eFG. - effective field goal per game, this statistic adjusts for the fact, that the 3-point shot is more valuable than a 2-point one
FT - free throw field goals per game
FTA - free throw field goal attempts per game
FT. - free throw field goal percentage
ORB - offensive rebounds per game
DRB - defensive rebounds per game
TRB - total rebounds per game
AST - assists per game
STL - steals per game
BLK - blocks per game
TOV - turnovers per game (losing a ball during the possession)
PF - personal fouls per game
PTS - points per game

3. Data Preparation

Let’s load the necessary packages:

library(dplyr)
library(ggplot2)
library(VIM)
library(factoextra)
library(psych)
library(corrplot)
library(maptools)
library(ggfortify)
library(cluster)
library(NbClust)
library(clustertend)
library(ggpubr)

Let’s have a look at the data using the VIM package:

read.csv("nba_player_stats.csv") -> data ##load the data

aggr_plot_0 <- aggr(data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##      X3P. 0.20627926
##       FT. 0.07872028
##        GS 0.06926467
##      X2P. 0.04553383
##       FG. 0.04296337
##      eFG. 0.04296337
##       Age 0.03837327
##         G 0.03837327
##        MP 0.03837327
##        FG 0.03837327
##       FGA 0.03837327
##       X3P 0.03837327
##      X3PA 0.03837327
##       X2P 0.03837327
##      X2PA 0.03837327
##        FT 0.03837327
##       FTA 0.03837327
##       ORB 0.03837327
##       DRB 0.03837327
##       TRB 0.03837327
##       AST 0.03837327
##       STL 0.03837327
##       BLK 0.03837327
##       TOV 0.03837327
##        PF 0.03837327
##       PTS 0.03837327
##    Player 0.00000000
##       Pos 0.00000000
##        Tm 0.00000000

I have scraped some empty rows, so let’s remove them. The only rows that do not contain anything in the Age column are fully empty.

df = data %>%
  filter(!is.na(Age))

aggr_plot_1 <- aggr(df, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##      X3P. 0.174606205
##       FT. 0.041957041
##        GS 0.032124105
##      X2P. 0.007446301
##       FG. 0.004773270
##      eFG. 0.004773270
##    Player 0.000000000
##       Pos 0.000000000
##       Age 0.000000000
##        Tm 0.000000000
##         G 0.000000000
##        MP 0.000000000
##        FG 0.000000000
##       FGA 0.000000000
##       X3P 0.000000000
##      X3PA 0.000000000
##       X2P 0.000000000
##      X2PA 0.000000000
##        FT 0.000000000
##       FTA 0.000000000
##       ORB 0.000000000
##       DRB 0.000000000
##       TRB 0.000000000
##       AST 0.000000000
##       STL 0.000000000
##       BLK 0.000000000
##       TOV 0.000000000
##        PF 0.000000000
##       PTS 0.000000000

summary(df)

##     Player              Pos                 Age             Tm           
##  Length:20950       Length:20950       Min.   :18.00   Length:20950      
##  Class :character   Class :character   1st Qu.:24.00   Class :character  
##  Mode  :character   Mode  :character   Median :26.00   Mode  :character  
##                                        Mean   :26.76                     
##                                        3rd Qu.:29.00                     
##                                        Max.   :44.00                     
##                                                                          
##        G               GS              MP              FG        
##  Min.   : 1.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:25.00   1st Qu.: 0.00   1st Qu.:11.40   1st Qu.: 1.300  
##  Median :54.00   Median : 8.00   Median :19.10   Median : 2.500  
##  Mean   :48.85   Mean   :23.16   Mean   :19.85   Mean   : 3.062  
##  3rd Qu.:74.00   3rd Qu.:44.00   3rd Qu.:28.20   3rd Qu.: 4.400  
##  Max.   :85.00   Max.   :83.00   Max.   :44.50   Max.   :13.400  
##                  NA's   :673                                     
##       FGA              FG.              X3P              X3PA       
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 3.100   1st Qu.:0.4000   1st Qu.:0.0000   1st Qu.: 0.000  
##  Median : 5.600   Median :0.4440   Median :0.1000   Median : 0.500  
##  Mean   : 6.767   Mean   :0.4385   Mean   :0.4236   Mean   : 1.236  
##  3rd Qu.: 9.500   3rd Qu.:0.4870   3rd Qu.:0.7000   3rd Qu.: 2.000  
##  Max.   :27.800   Max.   :1.0000   Max.   :5.1000   Max.   :13.200  
##                   NA's   :100                                       
##       X3P.            X2P              X2PA            X2P.      
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.00   Min.   :0.000  
##  1st Qu.:0.129   1st Qu.: 1.000   1st Qu.: 2.40   1st Qu.:0.425  
##  Median :0.300   Median : 2.050   Median : 4.40   Median :0.470  
##  Mean   :0.256   Mean   : 2.638   Mean   : 5.53   Mean   :0.461  
##  3rd Qu.:0.365   3rd Qu.: 3.700   3rd Qu.: 7.80   3rd Qu.:0.507  
##  Max.   :1.000   Max.   :13.200   Max.   :27.00   Max.   :1.000  
##  NA's   :3658                                     NA's   :156    
##       eFG.              FT              FTA              FT.        
##  Min.   :0.0000   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:0.4350   1st Qu.: 0.500   1st Qu.: 0.800   1st Qu.:0.6610  
##  Median :0.4780   Median : 1.100   Median : 1.500   Median :0.7500  
##  Mean   :0.4677   Mean   : 1.523   Mean   : 2.042   Mean   :0.7236  
##  3rd Qu.:0.5150   3rd Qu.: 2.100   3rd Qu.: 2.800   3rd Qu.:0.8150  
##  Max.   :1.5000   Max.   :10.300   Max.   :13.100   Max.   :1.0000  
##  NA's   :100                                        NA's   :879     
##       ORB             DRB              TRB              AST        
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.400   1st Qu.: 1.200   1st Qu.: 1.600   1st Qu.: 0.600  
##  Median :0.800   Median : 2.000   Median : 2.800   Median : 1.300  
##  Mean   :1.008   Mean   : 2.462   Mean   : 3.468   Mean   : 1.846  
##  3rd Qu.:1.400   3rd Qu.: 3.300   3rd Qu.: 4.700   3rd Qu.: 2.500  
##  Max.   :7.000   Max.   :13.000   Max.   :18.700   Max.   :14.500  
##                                                                    
##       STL              BLK              TOV              PF       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.3000   1st Qu.:0.1000   1st Qu.:0.600   1st Qu.:1.300  
##  Median :0.6000   Median :0.2000   Median :1.100   Median :1.900  
##  Mean   :0.6561   Mean   :0.4035   Mean   :1.236   Mean   :1.908  
##  3rd Qu.:0.9000   3rd Qu.:0.5000   3rd Qu.:1.700   3rd Qu.:2.500  
##  Max.   :3.7000   Max.   :6.0000   Max.   :5.700   Max.   :6.000  
##                                                                   
##       PTS        
##  Min.   : 0.000  
##  1st Qu.: 3.400  
##  Median : 6.500  
##  Mean   : 8.068  
##  3rd Qu.:11.500  
##  Max.   :37.100  
##

We can observe, that we have a lot of missing values in the X3P. column. It’s due to the fact that if someone hasn’t attempted any 3 point shot during the season, then he will not have the according to it 3 point field goal percentage value. There are in total 3658 missing values. Removing observations containing these NA’s would result in deleting 17.5% of all observations in the dataset. Furthermore they are not missing at random, so substitution is definitely a better choice, rather than removing this variable.

Replacing NAs with the mean values doesn’t make much sense, as we would have players who have never made a 3 point in their career, and yet they would be making 1 out of 4 3-pointers per game. In that case I will just substitute them with zeros. It will result in bringing down the mean value of this variable, but it seems to be the best solution here.

A little more than 4% of all values in free throw percentage are missing, and it’s the same story as in the 3PT. missing values case. The last variable containing more than 1% of missing values is the “GS” which stands for Games Started. Here it’s also better to substitute these NAs with zeros. Even though that we may have as a result some first team players starting from the bench, it doesn’t affect the data too much.

Two variables containing NAs are 2 point field goal percentage and effective fg percentage. The observations containing missing values in these columns are 100% correlated and constitute to less than 0.5 % of all observations, so we may just remove them. We may also remove rows containing NAs in X2P. column, because there are only 56 of them.

df <- df %>%
  mutate(X3P. = ifelse(is.na(X3P.), 0, X3P.))

df <- df %>% mutate(FT. = ifelse(is.na(FT.), 0, FT.),
                    GS = ifelse(is.na(GS), 0, GS))

df <- df %>% filter(complete.cases(.))

summary(df)

##     Player              Pos                 Age             Tm           
##  Length:20794       Length:20794       Min.   :18.00   Length:20794      
##  Class :character   Class :character   1st Qu.:24.00   Class :character  
##  Mode  :character   Mode  :character   Median :26.00   Mode  :character  
##                                        Mean   :26.77                     
##                                        3rd Qu.:29.00                     
##                                        Max.   :44.00                     
##        G               GS              MP              FG        
##  Min.   : 1.00   Min.   : 0.00   Min.   : 0.70   Min.   : 0.000  
##  1st Qu.:25.00   1st Qu.: 0.00   1st Qu.:11.60   1st Qu.: 1.300  
##  Median :54.00   Median : 7.00   Median :19.20   Median : 2.500  
##  Mean   :49.21   Mean   :22.58   Mean   :19.98   Mean   : 3.084  
##  3rd Qu.:75.00   3rd Qu.:42.00   3rd Qu.:28.20   3rd Qu.: 4.400  
##  Max.   :85.00   Max.   :83.00   Max.   :44.50   Max.   :13.400  
##       FGA              FG.             X3P              X3PA       
##  Min.   : 0.200   Min.   :0.000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 3.200   1st Qu.:0.400   1st Qu.:0.0000   1st Qu.: 0.000  
##  Median : 5.700   Median :0.444   Median :0.1000   Median : 0.500  
##  Mean   : 6.815   Mean   :0.439   Mean   :0.4262   Mean   : 1.243  
##  3rd Qu.: 9.500   3rd Qu.:0.487   3rd Qu.:0.7000   3rd Qu.: 2.000  
##  Max.   :27.800   Max.   :1.000   Max.   :5.1000   Max.   :13.200  
##       X3P.            X2P              X2PA             X2P.      
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.100   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.: 1.100   1st Qu.: 2.400   1st Qu.:0.425  
##  Median :0.250   Median : 2.100   Median : 4.400   Median :0.470  
##  Mean   :0.212   Mean   : 2.657   Mean   : 5.571   Mean   :0.461  
##  3rd Qu.:0.353   3rd Qu.: 3.800   3rd Qu.: 7.800   3rd Qu.:0.507  
##  Max.   :1.000   Max.   :13.200   Max.   :27.000   Max.   :1.000  
##       eFG.             FT              FTA              FT.        
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:0.435   1st Qu.: 0.600   1st Qu.: 0.800   1st Qu.:0.6440  
##  Median :0.478   Median : 1.100   Median : 1.500   Median :0.7450  
##  Mean   :0.468   Mean   : 1.533   Mean   : 2.055   Mean   :0.6975  
##  3rd Qu.:0.515   3rd Qu.: 2.100   3rd Qu.: 2.800   3rd Qu.:0.8120  
##  Max.   :1.000   Max.   :10.300   Max.   :13.100   Max.   :1.0000  
##       ORB             DRB              TRB              AST        
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.400   1st Qu.: 1.200   1st Qu.: 1.700   1st Qu.: 0.600  
##  Median :0.800   Median : 2.100   Median : 2.800   Median : 1.300  
##  Mean   :1.015   Mean   : 2.478   Mean   : 3.491   Mean   : 1.858  
##  3rd Qu.:1.400   3rd Qu.: 3.300   3rd Qu.: 4.700   3rd Qu.: 2.500  
##  Max.   :7.000   Max.   :13.000   Max.   :18.700   Max.   :14.500  
##       STL              BLK              TOV              PF       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.3000   1st Qu.:0.1000   1st Qu.:0.600   1st Qu.:1.300  
##  Median :0.6000   Median :0.2000   Median :1.100   Median :1.900  
##  Mean   :0.6605   Mean   :0.4064   Mean   :1.244   Mean   :1.919  
##  3rd Qu.:0.9000   3rd Qu.:0.5000   3rd Qu.:1.700   3rd Qu.:2.500  
##  Max.   :3.7000   Max.   :6.0000   Max.   :5.700   Max.   :6.000  
##       PTS        
##  Min.   : 0.000  
##  1st Qu.: 3.500  
##  Median : 6.600  
##  Mean   : 8.126  
##  3rd Qu.:11.600  
##  Max.   :37.100

aggr_plot_1 <- aggr(df, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable Count
##    Player     0
##       Pos     0
##       Age     0
##        Tm     0
##         G     0
##        GS     0
##        MP     0
##        FG     0
##       FGA     0
##       FG.     0
##       X3P     0
##      X3PA     0
##      X3P.     0
##       X2P     0
##      X2PA     0
##      X2P.     0
##      eFG.     0
##        FT     0
##       FTA     0
##       FT.     0
##       ORB     0
##       DRB     0
##       TRB     0
##       AST     0
##       STL     0
##       BLK     0
##       TOV     0
##        PF     0
##       PTS     0

Our data is almost ready, but we have to remove columns containing character values, for the PCA and Clustering algorithms work only with numeric data. It’s worth to create a new data frame, as we will need these variables later.

ready_df <- df %>% select(c(-Tm, -Pos, -Player))

4. Principal Component Analysis

Let’s first reduce the number of dimensions and see which variables are useful and which are problematic. We want to scale and center the data. The reason for this is that different variables have different scales, so some of them may impact the results drastically more than the other ones. The following formula is used to scale the data: \[\frac{x_i - mean(x)}{sd(x)}\]

ready_df.pcal <- prcomp(ready_df, center = TRUE, scale = TRUE)
ready_df.pcal

## Standard deviations (1, .., p=26):
##  [1] 3.561799206 1.885727606 1.561046383 1.147309396 1.009314289 0.935816411
##  [7] 0.910441143 0.759759604 0.732612388 0.642501627 0.618182080 0.585415085
## [13] 0.536973188 0.513600264 0.368946454 0.343751603 0.332258308 0.247095855
## [19] 0.130672191 0.118218973 0.104348406 0.095259857 0.015096195 0.013896727
## [25] 0.009448687 0.007090579
## 
## Rotation (n x k) = (26 x 26):
##              PC1          PC2         PC3         PC4          PC5         PC6
## Age  -0.01097663 -0.035546454  0.01893441 -0.28159556  0.822755216 -0.46472679
## G    -0.17715283  0.003064152  0.09566874 -0.12373119  0.171271196  0.55971901
## GS   -0.22078411 -0.016912450 -0.03230595 -0.11972426  0.029376117  0.15640818
## MP   -0.26540505 -0.069528474 -0.01998409 -0.09375760  0.048051023  0.05311946
## FG   -0.26864635 -0.058908883 -0.01973744  0.08585258 -0.058425460 -0.12004979
## FGA  -0.26190878 -0.123923612 -0.06215291  0.05775409 -0.066511576 -0.11497976
## FG.  -0.11182958  0.240254258  0.48445283  0.16687118  0.025415817 -0.06228079
## X3P  -0.09522200 -0.395780386  0.16836627 -0.33994321 -0.208060037 -0.13338128
## X3PA -0.09656538 -0.403831860  0.14147653 -0.31708060 -0.210959792 -0.13705921
## X3P. -0.05592784 -0.328628102  0.18544724 -0.17945013 -0.071843182  0.09125819
## X2P  -0.26091211  0.053095193 -0.07090421  0.19247092 -0.001730109 -0.09022775
## X2PA -0.25748236  0.020374537 -0.12646944  0.19127869  0.008609941 -0.07564859
## X2P. -0.10059577  0.179548318  0.50934341  0.14331253 -0.015192810 -0.08649943
## eFG. -0.10987305  0.101898971  0.56139700  0.02519108 -0.031795840 -0.07752961
## FT   -0.24435126 -0.035091290 -0.09696219  0.17892805 -0.024074051 -0.14339216
## FTA  -0.24596114  0.014672209 -0.10983122  0.16389781 -0.056944269 -0.15949958
## FT.  -0.10408341 -0.154760194  0.12193760  0.01939703  0.325732902  0.45059268
## ORB  -0.17193764  0.340205024 -0.09647138 -0.18796311 -0.052780543  0.02904160
## DRB  -0.22133084  0.189387185 -0.05911917 -0.31530015 -0.054999445 -0.04615678
## TRB  -0.21536228  0.249960218 -0.07483518 -0.28713789 -0.056891871 -0.02263865
## AST  -0.17439935 -0.243541141 -0.02521452  0.24164710  0.184531147  0.07664074
## STL  -0.20184963 -0.146796511 -0.03067479  0.11304823  0.092389653  0.15906012
## BLK  -0.13249623  0.294497942 -0.06344197 -0.30927260 -0.120391134  0.01266459
## TOV  -0.24280020 -0.066962738 -0.09926577  0.17133968  0.028301401 -0.01840197
## PF   -0.20528357  0.152555514 -0.03350578 -0.15124416  0.077198494  0.19157970
## PTS  -0.26866069 -0.093012157 -0.02031625  0.07146426 -0.070694932 -0.13707429
##                PC7         PC8         PC9         PC10          PC11
## Age   0.0203982618 -0.08468745  0.05590760 -0.029997309 -6.060989e-02
## G     0.0975116295 -0.25254532  0.47482504 -0.016809470 -1.541947e-01
## GS    0.2451674357 -0.12245490  0.38618510 -0.162826836  3.616645e-01
## MP    0.1047014692  0.03284434 -0.01131910  0.074564859 -1.405055e-02
## FG   -0.0835850482 -0.10580637  0.10351242  0.032818722 -8.761433e-02
## FGA  -0.0806953331 -0.06581741  0.09651680  0.053290707 -1.119590e-01
## FG.   0.0393721228 -0.01657724 -0.01122879 -0.004880908 -1.075426e-03
## X3P   0.0002843858  0.20914338  0.12670227  0.045465681 -5.896981e-02
## X3PA  0.0122301480  0.25869615  0.12594992  0.043960123 -4.728158e-02
## X3P. -0.1157952988 -0.69914293 -0.49559640 -0.055498474  1.942082e-02
## X2P  -0.0902443806 -0.17481165  0.07447223  0.021991758 -7.672006e-02
## X2PA -0.0960613889 -0.17721063  0.05876967  0.042663224 -1.077308e-01
## X2P.  0.0574419814  0.12880475  0.04862627 -0.007154679  4.664632e-02
## eFG.  0.0226842106 -0.03159506 -0.05208011 -0.007338568 -6.513683e-05
## FT   -0.2411542622  0.05543881  0.04512654 -0.105349623  4.660993e-02
## FTA  -0.1887955973  0.03992203  0.02540222 -0.094873307  3.336778e-02
## FT.  -0.6509761494  0.33640646 -0.14627190 -0.048719126  1.632419e-01
## ORB  -0.0383700620 -0.03666841 -0.10320467  0.311598473  1.834831e-01
## DRB  -0.0101470282  0.05264768 -0.13079176  0.099601632  3.212110e-01
## TRB  -0.0200284283  0.02473147 -0.12757786  0.176566109  2.899322e-01
## AST   0.3903355236  0.15019415 -0.21506369 -0.186274198  2.374682e-01
## STL   0.3824940832  0.14056644 -0.29626509  0.034776039  4.131361e-02
## BLK  -0.0188450210  0.07835256 -0.13756015 -0.813966598 -2.332334e-01
## TOV   0.1505650366  0.12558366 -0.18955809 -0.053276732 -5.430948e-02
## PF    0.1171717716  0.17120126 -0.20763440  0.302229813 -6.523564e-01
## PTS  -0.1193740441 -0.04532223  0.10138611  0.004765958 -6.107688e-02
##              PC12         PC13         PC14         PC15         PC16
## Age  -0.024931136  0.064301714  0.007228253 -0.024846628 -0.046951029
## G    -0.275761833  0.238291999 -0.376149173  0.041727787  0.004799647
## GS    0.085611876 -0.339456398  0.612671593 -0.038200095 -0.047110343
## MP    0.102147772 -0.060603392  0.034280548  0.017817773  0.011015715
## FG    0.232552871 -0.003248433 -0.077529701 -0.012428025  0.021974345
## FGA   0.242226953 -0.003928528 -0.086908150 -0.004397996 -0.021452374
## FG.  -0.011499757 -0.029375538  0.027242264 -0.174392017  0.284124972
## X3P  -0.029892305  0.024021771 -0.036255203 -0.169869273  0.074991114
## X3PA -0.009308867  0.047932048 -0.063577202 -0.015225408 -0.155187113
## X3P. -0.113850267 -0.058001557  0.094001852  0.065575977 -0.148277402
## X2P   0.259065029 -0.010753979 -0.072699468  0.036538265  0.002587060
## X2PA  0.277559224 -0.023554022 -0.072685301  0.001082137  0.037299374
## X2P.  0.044142887  0.032662503 -0.012341250  0.423153901 -0.653686853
## eFG. -0.044696752 -0.021796851  0.022300596 -0.246304453  0.373900527
## FT   -0.470905885  0.140625242  0.187352142  0.052063083  0.023974659
## FTA  -0.499006001  0.155607112  0.166893813  0.030280533  0.009084964
## FT.   0.189692733 -0.072386614  0.063081697 -0.048091950 -0.011911142
## ORB  -0.036553205  0.130024943 -0.035667666 -0.629302768 -0.401015948
## DRB  -0.014548159 -0.049825814 -0.189652213  0.472366494  0.302289319
## TRB  -0.022861842  0.009007682 -0.146515026  0.121744429  0.078320000
## AST  -0.088408139 -0.274497540 -0.325283890 -0.119248479  0.002360862
## STL   0.192149302  0.696199555  0.268848586  0.045017407  0.072288828
## BLK   0.132233561  0.077835531 -0.046872098 -0.088173550 -0.058650242
## TOV  -0.198332885 -0.297566237 -0.230005443 -0.108597989 -0.154739727
## PF   -0.128921256 -0.293947996  0.288181550  0.122071060  0.030497775
## PTS   0.060400328  0.032977716 -0.017980950 -0.014111146  0.029856035
##               PC17          PC18         PC19         PC20         PC21
## Age   0.0464189209  0.0294312240 -0.005814938 -0.003857841 -0.008733541
## G     0.0383475299  0.0308946539 -0.011998283 -0.005451421  0.006329323
## GS    0.1110859566  0.1149177149 -0.024459619 -0.030082074 -0.005216374
## MP   -0.1956541256 -0.9066309524  0.097451418  0.068325585 -0.012977694
## FG   -0.0211967527  0.1355891380  0.343278187  0.105581124 -0.114840024
## FGA  -0.0082335044  0.0189060870 -0.445269979 -0.198312878  0.076359807
## FG.   0.0289059494 -0.0065971892 -0.414216705  0.503027052 -0.231960950
## X3P   0.0140493176  0.0898692064  0.151377816  0.396745526 -0.050232178
## X3PA -0.0007583355  0.0528341119 -0.267204817 -0.289611736 -0.036580590
## X3P. -0.0073347680  0.0171681683 -0.040341274  0.053951368 -0.025042163
## X2P  -0.0274387053  0.1199221453  0.328726721  0.006456309 -0.108592871
## X2PA -0.0090977829  0.0004969614 -0.397791138 -0.112618378  0.101739512
## X2P. -0.0536522974  0.0055148583  0.055860548  0.016406178  0.032565428
## eFG.  0.0553017720 -0.0344446476  0.256348977 -0.557002386  0.225319624
## FT   -0.1131430838 -0.0116809898 -0.052834174  0.189040283  0.643792032
## FTA  -0.0855061997 -0.0454078599 -0.017755165 -0.246367623 -0.652840044
## FT.   0.0445695639  0.0317623922  0.003947488 -0.008996662 -0.052066891
## ORB  -0.1038332659  0.0533713946 -0.002094248 -0.001813434  0.032563323
## DRB   0.0591555042  0.0759363719 -0.023880503  0.003764678  0.002893477
## TRB   0.0064723587  0.0722439911 -0.018393223  0.003329964  0.012941469
## AST  -0.5074368242  0.1931754960 -0.011503073 -0.020457035 -0.002628955
## STL   0.1591894034  0.0821588537  0.005041825  0.008059840  0.005580777
## BLK  -0.0411812020 -0.0038116606 -0.012229569  0.006061897  0.013117198
## TOV   0.7670662271 -0.0926003106  0.035521753  0.027788082  0.019788081
## PF   -0.1623503993  0.1903780748  0.003591677 -0.016514985  0.007865294
## PTS  -0.0409627885  0.1072655280  0.255644005  0.159136822  0.055881686
##               PC22          PC23          PC24          PC25          PC26
## Age  -0.0009743366 -8.792264e-05 -1.556063e-04 -3.921777e-05 -2.640656e-05
## G     0.0013692409  3.426063e-04 -5.357593e-04 -3.788918e-04  2.282033e-04
## GS    0.0052415902  4.868648e-05 -2.124374e-04 -6.161283e-05 -1.023642e-04
## MP   -0.0465196004 -1.267897e-04 -3.073885e-04 -7.143599e-05  8.864286e-05
## FG   -0.1408043369 -4.018397e-02  4.193445e-01 -6.648942e-01  1.236655e-04
## FGA   0.1430414288 -3.643081e-03  4.273442e-03  2.022663e-03 -7.242944e-01
## FG.  -0.2618256307  8.755595e-04 -1.856335e-03 -4.470153e-04 -2.649511e-05
## X3P   0.5224744073  3.023341e-02 -2.456062e-01 -3.691908e-02  8.945460e-04
## X3PA -0.5505184911  2.441823e-03 -1.496398e-02 -2.461625e-03  2.513845e-01
## X3P. -0.0233954860  8.764703e-05 -8.156167e-04 -1.574820e-04 -1.080143e-04
## X2P  -0.2844527040  8.386062e-02 -7.188104e-01  1.212598e-01 -6.724184e-03
## X2PA  0.3737213068  3.449313e-03 -2.059911e-03 -4.777931e-03  6.419243e-01
## X2P.  0.1690500512 -2.235465e-04  2.787143e-03  2.604653e-04  7.574355e-04
## eFG.  0.1265786016 -6.849750e-04 -1.245347e-03 -5.814467e-05 -7.175942e-04
## FT   -0.1203030059  1.570995e-02 -1.090061e-01 -1.658676e-01 -2.312652e-03
## FTA   0.1581592885  1.707601e-04 -1.132206e-03 -2.740174e-03  4.717143e-04
## FT.   0.0043629042 -1.759244e-04  2.280596e-05 -1.040878e-04  4.385737e-05
## ORB  -0.0041812612  2.648108e-01  3.151845e-02  4.318650e-03 -1.209294e-03
## DRB   0.0095309797  5.535313e-01  6.642349e-02  8.201958e-03 -2.347897e-03
## TRB   0.0057998965 -7.804821e-01 -9.313889e-02 -1.178684e-02  3.498629e-03
## AST   0.0221146644 -2.321055e-05  4.895013e-04  1.220175e-04  3.010996e-05
## STL   0.0063937685  1.359198e-04  1.773733e-04  2.800215e-04  1.556614e-05
## BLK   0.0024539063 -1.494622e-04  4.596989e-05  2.071876e-04 -4.107934e-05
## TOV  -0.0117931271 -2.040208e-04 -2.054870e-04  2.419361e-04  8.732850e-05
## PF   -0.0027528254  2.062095e-04 -1.626481e-04 -9.216130e-05 -9.330674e-05
## PTS  -0.0828069281 -6.697081e-02  4.700156e-01  7.169872e-01  8.040200e-03

The output may seem a little bit scary at first, but it is just the transformation matrix consisting of Principal Components as columns. The result of PCA is always a k*k matrix, where k is the smaller number of rows or columns of the original dataset. Our dataset has 20794 rows and 26 columns, so the result is a 26x26 matrix. There can be only as many Principal Components as the smaller number of rows or columns in the dataset.

Let’s look at the eigen values of Principal Components.

eig.val <- get_eigenvalue(ready_df.pcal)
eig.val

##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  1.268641e+01     4.879390e+01                    48.79390
## Dim.2  3.555969e+00     1.367680e+01                    62.47070
## Dim.3  2.436866e+00     9.372561e+00                    71.84326
## Dim.4  1.316319e+00     5.062765e+00                    76.90603
## Dim.5  1.018715e+00     3.918136e+00                    80.82416
## Dim.6  8.757524e-01     3.368278e+00                    84.19244
## Dim.7  8.289031e-01     3.188089e+00                    87.38053
## Dim.8  5.772347e-01     2.220133e+00                    89.60066
## Dim.9  5.367209e-01     2.064311e+00                    91.66497
## Dim.10 4.128083e-01     1.587724e+00                    93.25270
## Dim.11 3.821491e-01     1.469804e+00                    94.72250
## Dim.12 3.427108e-01     1.318119e+00                    96.04062
## Dim.13 2.883402e-01     1.109001e+00                    97.14962
## Dim.14 2.637852e-01     1.014559e+00                    98.16418
## Dim.15 1.361215e-01     5.235442e-01                    98.68772
## Dim.16 1.181652e-01     4.544814e-01                    99.14221
## Dim.17 1.103956e-01     4.245984e-01                    99.56680
## Dim.18 6.105636e-02     2.348322e-01                    99.80164
## Dim.19 1.707522e-02     6.567393e-02                    99.86731
## Dim.20 1.397573e-02     5.375279e-02                    99.92106
## Dim.21 1.088859e-02     4.187919e-02                    99.96294
## Dim.22 9.074440e-03     3.490169e-02                    99.99784
## Dim.23 2.278951e-04     8.765197e-04                    99.99872
## Dim.24 1.931190e-04     7.427655e-04                    99.99946
## Dim.25 8.927769e-05     3.433757e-04                    99.99981
## Dim.26 5.027631e-05     1.933704e-04                   100.00000

Eigenvalues correspond to the amount of variance explained by the Principal Components. The first Component explains the most amount of variance of the dataset, and every other one adequately less. We may see that the first 3 PCs explain 71% of variance and 5 of them explain 80%. It will be more visible with the scree plot.

fviz_eig(ready_df.pcal, addlabels = TRUE, ylim = c(0, 50))

There is no common rule how many PCs should be kept, it is for us to decide. We may look at the scree plot and see when PCs look similar and not significant. In this case it would make sens to retain only 3, 5 or 7 of them. The other approach is to retain only PCs which have corresponding eigenvalues to them bigger than 1. According to the Kaiser Rule, eigenvalue bigger than 1 means that the PC explains more variance than the original variable. In our case it would result in keeping 5 of them. We may also run a parallel analysis, which compares the obtained eigenvalues from the PCA with the ones generated from the Monte-Carlo simulated data of the same size.

As we can see, it suggests to retain only 4 of them, but the fifth one’s adjusted eigenvalue is 0.98 which is really close to 1. Let’s keep 5 of them and see what it brings.

Let’s see how our variables look:

var <- get_pca_var(ready_df.pcal)

#variable correlation plot
fviz_pca_var(ready_df.pcal, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE # Avoid text overlapping
)

The graph above shows us how well our variables are represented by the first 2 PCs. The longer the factors (the closer they are to the circumference of the correlation circle), the better their quality of representation. Colors help us in the interpretation as well. The warmer the color, the better the representation. The closer the factors are to each other, the more they are correlated with each other.

We can observe that points (PTS), field goals per game (FG), field goal attempts (FGA) and 2-point field goals per game (X2P) have the best quality of representation. We can also see that age, effective field goal percentage (eFG.) and 2-point field goal percentage are not well represented by the first 2 PCs. The other thing worth noticing is that players shooting from long distance (X3P, X3PA, X3P.) are not likely to have a lot of rebounds (DRB, ORB) and blocks (BLK), and it works the other way around as well. We will see later whether this trend has changed over the years.

We may also group the variables:

# Color variables by groups
res.km <- kmeans(var$coord, centers = 4, nstart = 25)
grp <- as.factor(res.km$cluster)
fviz_pca_var(ready_df.pcal, col.var = grp, 
             palette = c("#E495A5", "#ABB065", "#39BEB1", "#ACA4E2"),
             legend.title = "Cluster", repel = TRUE)

Let’s look as well at the second and third PCs.

fviz_pca_var(ready_df.pcal, col.var = "cos2",
             axes = (c(2,3)),
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE)

We may see that second and third dimensions represent well the effectiveness of scoring, in other words how effective and consistent players are. The negative correlation between shooting threes and defense is articulated even more.

Before proceeding further with our analysis, let’s first see which variables may be problematic by checking their uniqueness and complexity. We will do that with the help of psych package. We will obtain rotated PCs, so they are easier to interpret. We will be also able to check earlier mentioned features of variables.

xxx.pca1<-principal(ready_df, nfactors=5, rotate="varimax") 
print(loadings(xxx.pca1), digits=3, cutoff=0.5, sort=TRUE)

## 
## Loadings:
##      RC1    RC4    RC3    RC2    RC5   
## GS    0.616                            
## MP    0.781                            
## FG    0.870                            
## FGA   0.878                            
## X2P   0.875                            
## X2PA  0.898                            
## FT    0.864                            
## FTA   0.842                            
## AST   0.768                            
## STL   0.738                            
## TOV   0.874                            
## PTS   0.877                            
## ORB          0.820                     
## DRB          0.826                     
## TRB          0.864                     
## BLK          0.797                     
## PF           0.614                     
## FG.                 0.948              
## X2P.                0.934              
## eFG.                0.942              
## X3P                        0.945       
## X3PA                       0.931       
## X3P.                       0.704       
## Age                               0.890
## G                                      
## FT.                                    
## 
##                  RC1   RC4   RC3   RC2   RC5
## SS loadings    9.401 4.540 2.982 2.976 1.114
## Proportion Var 0.362 0.175 0.115 0.114 0.043
## Cumulative Var 0.362 0.536 0.651 0.765 0.808

plot(xxx.pca1$complexity)

plot(xxx.pca1$uniquenesses)

We may see that there are few variables which have high complexity and uniqueness values. A high uniqueness value means that the variable’s variance is not shared with other variables, so it’s simply unique. It is an undesirable feature in PCA, because it makes reducing the number of dimensions harder.

High complexity is also not a desirable feature, for it means that a variable loads on many factors. This as a result makes it harder to interpret the loadings of factors, which we will do soon.

plot(xxx.pca1$complexity, xxx.pca1$uniqueness, xlim=c(0, 4))
pointLabel(xxx.pca1$complexity, xxx.pca1$uniqueness, labels=names(xxx.pca1$uniqueness), cex=0.8)
abline(h=c(0.38, 0.75), lty=3, col=2) 
abline(v=c(1.8), lty=3, col=2)

It seems that we may remove free throw percentage (FT.), minutes played (MP), games started and games played (GS, G). I would like to keep personal fouls and 3-point field goal percentage.

final_df <- ready_df %>% 
  dplyr::select(c(-G, -GS, -MP, -FT.))

final_df.pcal <- prcomp(final_df, center = TRUE, scale = TRUE, rank. = 5)
#Scree plot
fviz_eig(final_df.pcal, addlabels = TRUE, ylim = c(0, 50))

Now our first 5 PCs explain even as much variance as 85%, but it was to be expected as we have now less variables in the data frame.

Let’s have a look now at the contribution of variables to single PCs:

Contribution of variables

PC1

# Contribution of variables to PC1
fviz_contrib(final_df.pcal, choice = "var", axes = 1, top = 15)

PC2

# Contribution of variables to PC2
fviz_contrib(final_df.pcal, choice = "var", axes = 2, top = 15)

PC3

# Contribution of variables to PC3
fviz_contrib(final_df.pcal, choice = "var", axes = 3, top = 15)

PC4

# Contribution of variables to PC4
fviz_contrib(final_df.pcal, choice = "var", axes = 4, top = 15)

PC5

# Contribution of variables to PC5
fviz_contrib(final_df.pcal, choice = "var", axes = 5, top = 15)

We may see that a lot of variables contribute to the first PC, but the main variables are related to scoring points. 3-point shooting contributes the most to the second PC. Scoring points contributes to PC 3. Variables in PC4 are also related to 3-point shot, but they are also related to defense (DRB, BLK). Age contributes to almost the whole PC5, that’s why it had related to it eigenvalue close to 1, when we were choosing the right amount of PCs.

Rotated PCs

Now let’s rotate the PCs, which will help us to name to Components when looking at the loadings of factors.

xxx.pca2<-principal(final_df, nfactors=5, rotate="varimax") 
print(loadings(xxx.pca2), digits=3, cutoff=0.5, sort=TRUE)

## 
## Loadings:
##      RC1    RC4    RC3    RC2    RC5   
## FG    0.863                            
## FGA   0.871                            
## X2P   0.869                            
## X2PA  0.892                            
## FT    0.853                            
## FTA   0.833                            
## AST   0.783                            
## STL   0.747                            
## TOV   0.877                            
## PTS   0.869                            
## ORB          0.830                     
## DRB          0.832                     
## TRB          0.872                     
## BLK          0.804                     
## PF           0.612                     
## FG.                 0.951              
## X2P.                0.938              
## eFG.                0.946              
## X3P                        0.949       
## X3PA                       0.935       
## X3P.                       0.711       
## Age                               0.984
## 
##                  RC1   RC4   RC3   RC2   RC5
## SS loadings    7.995 4.181 2.890 2.781 1.018
## Proportion Var 0.363 0.190 0.131 0.126 0.046
## Cumulative Var 0.363 0.553 0.685 0.811 0.857

I would call these factors the following:

PC1: Ability to score points, in other words how many points this player can score for the team, but also how well does he pass the ball and steals it from the opponent team.

PC2: 3-point shooting, how well does this player shoot threes.

PC3: Effectiveness and efficiency, so this factor explains how efficient is the player when shooting. Some players may score 15 points and have 20 field goal attempts (=low efficiency). Other play may score 15 points with only 5 field goal attempts. It makes a huge difference for the team.

PC4: Defense, here the interpretation is simple. This factor explains which players grab a lot of rebounds and block shots. What’s interesting is the fact that they tend to have more fouls as well. It’s quite logical, because defending in the post (close to the rim) is more physical than defending on the perimeter (far from the rim). As a consequence, under the basket players foul more.

PC5: Age, no need for interpretation.

Visualization of the PCA

Let’s see how the players look when plotted in the first 2 dimensions.

PC 1 and 2

autoplot(
  final_df.pcal,
  data = df,
  colour = "Pos",
  loadings = TRUE, loadings.colour = "red", loadings.size = 4,
  loadings.label = TRUE,  loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("Ability to score points") + ylab("3-point shooting")

PC 2 and 3

autoplot(
  final_df.pcal,
  x = 2,
  y = 3,
  data = df,
  colour = "Pos",
  loadings = TRUE, loadings.colour = "red", loadings.size = 4,
  loadings.label = TRUE,  loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("3-point shooting") + ylab("Effectiveness and efficiency")

PC 1 and 3

autoplot(
  final_df.pcal,
  x = 1,
  y = 3,
  data = df,
  colour = "Pos",
  loadings = TRUE, loadings.colour = "red", loadings.size = 4,
  loadings.label = TRUE,  loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("Ability to score points") + ylab("Effectiveness and efficiency")

PC 3 and 4

autoplot(
  final_df.pcal,
  x = 3,
  y = 4,
  data = df,
  colour = "Pos",
  loadings = TRUE, loadings.colour = "red", loadings.size = 4,
  loadings.label = TRUE,  loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("Effectiveness and efficiency") + ylab("Defense")

PC 4 and 5

autoplot(
  final_df.pcal,
  x = 4,
  y = 5,
  data = df,
  colour = "Pos",
  loadings = TRUE, loadings.colour = "red", loadings.size = 4,
  loadings.label = TRUE,  loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("Defense") + ylab("Age")

PC 2 and 4

autoplot(
  final_df.pcal,
  x = 2,
  y = 4,
  data = df,
  colour = "Pos",
  loadings = TRUE, loadings.colour = "red", loadings.size = 4,
  loadings.label = TRUE,  loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("3-point shooting") + ylab("Defense")

PC 1 and 4

autoplot(
  final_df.pcal,
  x = 1,
  y = 4,
  data = df,
  colour = "Pos",
  loadings = TRUE, loadings.colour = "red", loadings.size = 4,
  loadings.label = TRUE,  loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("Ability to score points") + ylab("Defense")

Results of the PCA

We may see that there are a lot of ways in which we can look at NBA players. On the first graph there is a clear distinction between all positions in terms of shooting threes. All players regardless of position may score the ball well or poorly. The bigger guys however (centers, power forwards) tend to shoot less threes, while point guards and shooting guards are good at them. Small forwards tend to be somewhere in the middle.

PC 2 and 3 show that there are slightly more effective scorers (high field goal percentage) between centers and forwards than between the guards. We may see that there are some players that group together and form clearly visible clusters. It’s a valuable insight, that the most efficient players are also these, who tend to shoot less threes. Shooting from long distance is generally harder, than shooting from close range. It’s fascinating, that there is a clearly visible cluster of players, which stand out from the rest, but they are mainly scoring from close distance. There is also the other cluster at the bottom, which represents players which are not efficient at all. Their role is probably more related to defense or other tasks.

PC 1 and 3 show us that generally, the more points players score, the smaller the efficiency and effectiveness.

5. Clustering of PCA Results

Clustering the PCA results from all seasons doesn’t take very long, but obtaining any information related to the clustering demands a lot of time and computing power. Let’s try to use clustering in a different way. Let’s see if we can observe any differences in the player stats between historic and modern players.

I will perform PCA on 500 players from modern times (2020-2021 season) and historic times (1980-1981, 1981-1982 seasons).

PCA

df_1980 <- df %>% dplyr::slice_head(n = 500) #First 500 observations from the dataset
df_2020 <- df %>% dplyr::slice_tail(n = 500) #Last 500 observations

#Remove unnecessary variables (rdf = ready data frame)
rdf_1980 <- df_1980 %>% dplyr::select(c(-Tm, -Pos, -Player, -G, -GS, -MP, -FT.))
rdf_2020 <- df_2020 %>% dplyr::select(c(-Tm, -Pos, -Player, -G, -GS, -MP, -FT.))

#PCA
pca_1980 <- prcomp(rdf_1980, center = TRUE, scale = TRUE, rank. = 5)
pca_2020 <- prcomp(rdf_2020, center = TRUE, scale = TRUE, rank. = 5)

# PCA of 1980 and 1981 seasons

PC 1 and 2

PC 2 and 3

PC 1 and 3

PCA of the 2020 season

PC 1 and 2

PC 2 and 3

PC 1 and 3

Clustering tendency

Let’s see if the data is clusterable with the Hopkins statistic. The smaller value we will obtain, the better.

# Hopkins statistic for 1980
hopkins(rdf_1980, n=nrow(pca_1980$x)-1)

## $H
## [1] 0.1352564

#Hopkins statistic for 2020
hopkins(rdf_2020, n=nrow(pca_2020$x)-1)

## $H
## [1] 0.1477269

Both values are smaller than 0.15 which indicates that our data is clusterable, but not perfect for clustering.

KMeans

Optimal Number of Clusters

Let’s check the optimal number of clusters with Silhouette Statistic and Gap Statistic.

Statistics

PCA 1980

fviz_nbclust(pca_1980$x, FUNcluster=kmeans, k.max = 8)

fviz_nbclust(pca_1980$x, FUNcluster=kmeans, method="gap_stat", k.max = 8) + theme_classic()

The optimal number of cluster may be 2 (Silhouette Statistic) or 5 (Gap Statistic).

PCA 2020

fviz_nbclust(pca_2020$x, FUNcluster=kmeans, k.max = 8)

fviz_nbclust(pca_2020$x, FUNcluster=kmeans, method="gap_stat", k.max = 8) + theme_classic()

The optimal number of cluster may be 2 (Silhouette Statistic) or 3 (Gap Statistic), however 5 doesn’t look bad either.

We may choose a handy package NbClust, which will calculate for us much more statistics.

Statistics from NbClust

PCA 1980

number <- NbClust(pca_1980$x, distance="euclidean", min.nc=2, max.nc=10, method="ward.D2", index="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 3 proposed 8 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

According to 23 different statistics, the optimal number of clusters is 3 with euclidean distance.

Let’s see if there is a difference with manhattan and minkowski distances.

number <- NbClust(pca_1980$x, distance="manhattan", min.nc=2, max.nc=10, method="ward.D2", index="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

number <- NbClust(pca_1980$x, distance="minkowski", min.nc=2, max.nc=10, method="ward.D2", index="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 3 proposed 8 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

I will go with 3 clusters then.

PCA 2020

number <- NbClust(pca_2020$x, distance="euclidean", min.nc=2, max.nc=10, method="ward.D2", index="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 3 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 5 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

number <- NbClust(pca_2020$x, distance="manhattan", min.nc=2, max.nc=10, method="ward.D2", index="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 8 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 3 proposed 7 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

number <- NbClust(pca_2020$x, distance="minkowski", min.nc=2, max.nc=10, method="ward.D2", index="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 3 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 5 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

According to all indices and 3 different distances, the optimal number is 2.

Let’s plot the results.

Kmeans clustering on 1980 PCA results

autoplot(kmeans(pca_1980$x, 3), data = pca_1980, frame = TRUE)

Kmeans clustering on 2020 PCA results

autoplot(kmeans(pca_2020$x, 2), data = pca_2020, frame = TRUE)

Looking at the plots, the conclusion here may be, that in the old NBA times we could divide the players between 3 groups:

Defenders - green cluster (almost no 3-point shots and 2-point shots)
Shooters - red cluster (players that shot a lot of threes = bottom of the red cluster and standard 2-point shooters)
2-point scorers - blue cluster (they were exceptionally good at scoring, but shot mostly 2-point shots)

For the modern times we may describe our clusters in the following way:

Not primary scorers - red cluster (their role is not to score a lot of points, but help team in other ways)
Scorers - blue cluster (efficient scorers, 3-point shooters and 2-point shooters)

The General conclusion would be that nowadays more and more players shoot the 3-point shot well and more players attempt it. In the old times, right after the introduction of the 3-point shot we may observe that not many players were good 3-point shooters, and even those who tried, weren’t the best efficient scorers. Nowadays players that attempt 3-point shots and make them, are the most lethal scorers. That’s why more and more players shoot from long distance, as the way of the game is being played, has changed.

As we can see on both plots, the clusters seem to be strongly influenced by the outliers. Let’s try to use a PAM clustering method, to see whether we could obtain different clustering results. Partitioning Around Medoids (PAM) is a method which is more robust to outliers, but takes longer to compute the clusters. Our data from single seasons is not so huge, so we can run PAM.

Partitioning Around Medoids (PAM)

PCA 1980

fviz_nbclust(pca_1980$x, FUNcluster=cluster::pam, k.max = 7)

fviz_nbclust(pca_1980$x, FUNcluster=cluster::pam, method="gap_stat", k.max  = 7)

Let’s try 7 clusters.

PCA 2020

fviz_nbclust(pca_2020$x, FUNcluster=cluster::pam, k.max = 7)

fviz_nbclust(pca_2020$x, FUNcluster=cluster::pam, method="gap_stat", k.max  = 7)

Let’s try 5 clusters.

PAM clustering on 1980 PCA results

autoplot(pam(pca_1980$x, 7), data = df_1980, frame = TRUE) + geom_text(label = df_1980$Player, check_overlap = TRUE)

PAM clustering on 2020 PCA results

autoplot(pam(pca_2020$x, 5), data = pca_2020, frame = TRUE) + geom_text(label = df_2020$Player, check_overlap = TRUE)

Comments on PAM results

As partitioning around medoids is robust to outliers, in this case it resulted in more clusters being suggested and the results are quite satisfying in my opinion. Let’s start with the old times.

1980s

The players that are in the right clusters, were dominating the league in the 80s. Big center Kareem-Abdul Jabbar or power forward Julius Erving were scoring lots of points and making spectacular plays. Kareem was a six time NBA champion, which is extremely rare for an NBA player. He also holds the first place in the NBA hall of fame for most points scored (38387). It’s not a surprise, that he is on the edge of the scoring cluster.

In the bottom right cluster we can see also excellent scorers, like Larry Bird, but they are rather power and small forwards, rather than centers. In the blue middle cluster we may see some point guards like for example an absolute superstar Pete Maravich, which was one of the best point guards in NBA history. His role was not solely to score points, but also to distribute the ball and make brilliant plays.

In the left clusters we may observe players, who had other roles on the court than scoring. They were starting usually from the bench and assisted the superstars. Players in the middle bottom cluster.

What’s worth to note, is that the best 3-point shooters in the early 80s averaged less than 1 3-point attempt per game.

2020s

Right blue cluster represents all the superstars of modern NBA. The closer to the top of the cluster, the better the 3-point shooters are there. For example Trae Young, Stephen Curry, James Harden or Damian Lillard are one of the best 3-point shooting guards of all time. They are exceptional in playmayking as well, dishing a lot passes to their teammates. At the bottom of the same cluster we may find players such as Andre Drummond (doesn’t shoot 3s at all, old school player that dominates the paint), Anthony Davis (incredible defender and 2-point scorer), Nikola Yokic (incredibly intelligent big man who can score from literally anywhere on the court and his passes are insane). They are all centers and power-forwards (bigger players).

Players such as Tacko Fall or JaVale Mc Gee are in the middle left clusters. They can sometimes score with a high efficiency under the basket, but their main role is defense. The left top green cluster represents players that are not good at scoring, but they do not have a lot of games played. In other words, they were contracted for a short time and didn’t work out in the NBA (Malik Newman, Stanton Kidd).

6. Market Basket Analysis

Market Basket Analysis is used mostly for the analysis of customer behavior in order to increase sales. Here however I will use it to find rules about basketball players and finish my analysis.

#Necessary packages
library(arules) 
library(arulesViz)
library(arulesCBA)

In order to perform the Market Basket Analysis, we have to recode the data.

#recode the variables
library(tidyverse)
df2 <- df
df2 <- df2 %>% mutate(Age = ifelse(Age < 27, "Young", ifelse(Age > 26 & Age <34, "Middle Aged", "Veteran")),
                    FGA = case_when(FGA < 4 ~ "Few FGA", FGA >= 4 & FGA < 14 ~ "Moderate FGA", FGA >=14 & FGA <18 ~ "Lots of FGA", FGA >= 18 ~ "Shooting Machine"),
                    G = case_when(G < 15 ~ "Few Games", G >= 15 & G < 50 ~ "Moderate Amount of Games", G >= 50 ~ "Playing Machine" ),
                    MP = case_when(MP < 10 ~ "Very few MP", MP >=10 & MP < 25 ~ "Moderate MP", MP >= 25 ~ "Lots of MP" ),
                    FG. = case_when(FG. < 0.25 ~ "Low FG", FG. >= 0.25 & FG. < 0.4 ~ "Moderate FG", FG. >=0.4 & FG. <0.5 ~ "Good FG", FG. >= 0.5 & FG. <0.6 ~ "Excellent FG", FG. >= 0.6 ~ "Lethal FG" ),
                    X3P = case_when(X3P < 0.5 ~ "Doesn't Shoot 3s", X3P >=0.5 & X3P < 3 ~ "Sometimes 3", X3P >= 3 & X3P < 5 ~ "Regular 3-point shooter", X3P >= 5 ~ "3-point Beast"),
                    X3P. = case_when(X3P. < 0.2 ~ "Tragic 3-point Efficiency", X3P. >= 0.2 & X3P. <0.4 ~ "Fair 3-point Efficiency", X3P. >=0.4 ~ "Lethal 3-point Efficiency" ),
                    X2P = case_when(X2P < 1 ~ "Doesn't shoot 2s", X2P >=1 & X2P < 4 ~ "Sometimes 2", X2P >= 4 & X2P < 7 ~ "regular 2-point shooter", X2P >= 7 ~ "2-point Beast"),
                    X2P. = case_when(X2P. < 0.2 ~ "Tragic 2-point Efficiency", X2P. >= 0.2 & X2P. <0.5 ~ "Fair 2-point Efficiency", X2P. >=0.5 & X2P. <0.7 ~ "Good 2-point Efficiency", X2P. >=0.7 ~ "Lethal 2-point Efficiency" ),
                    FT. = case_when(FT. <0.5 ~ "Tragic FT Efficiency", FT. >=0.5 & FT. < 0.7 ~ "Poor FT Efficiency", FT. >=0.7 & FT. < 0.8 ~ "Good FT", FT. >= 0.8 ~ "Excellent FT Efficiency" ),
                    TRB = case_when(TRB < 1 ~ "Doesn't Rebound", TRB >=1 & TRB < 4 ~ "Sometimes Rebound", TRB >= 4 & TRB <8 ~ "Good Rebound", TRB >= 8 & TRB < 12 ~ "Very Good Rebound", TRB >= 12 ~ "Rebounding Beast"),
                    ORB = case_when(ORB < 1 ~ "Bad ORB", ORB >= 1 & ORB < 2.5 ~ "Good ORB", ORB >= 2.5 & ORB < 3.5 ~ "Great ORB", ORB >= 3.5 ~ "ORB Beast"),
                    DRB = case_when(DRB < 3 ~ "Bad DRB", DRB >= 3 & DRB < 6 ~ "Good DRB", DRB >= 6 & DRB < 8 ~ "Great DRB", DRB >= 8 ~ "DRB Beast"),
                    AST = case_when(AST < 0.5 ~ "Doesn't Pass", AST >= 0.5 & AST < 1.5 ~ "Sometimes Pass", AST >= 1.5 & AST < 2.5 ~ "Decent Passer", AST >= 2.5 & AST < 3.5 ~ "Good Passer", AST >= 3.5 & AST < 5 ~ "Great Passer", AST >= 5 ~ "Passing Prodigy"),
                    STL = case_when(STL < 0.3 ~ "Doesn't Steal", STL >= 0.3 & STL < 1 ~ "Sometimes Steal", STL >= 1 & STL < 1.8 ~ "Great Steal", STL >= 1.8 ~ "Incredible Steal"),
                    BLK = case_when(BLK < 0.2 ~ "Doesn't Block", BLK >= 0.2 & BLK < 0.4 ~ "Sometimes Block", BLK >= 0.4 & BLK < 0.8 ~ "Good Block", BLK >= 0.8 ~ "Great Block" ),
                    TOV = case_when(TOV < 0.3 ~ "Few TOV", TOV >= 0.3 & TOV < 0.6 ~ "Sometimes TOV", TOV >= 0.6 & TOV < 1.5 ~ "Frequent TOV", TOV >= 1.5 ~ "Lots of TOV"),
                    PF = case_when(PF < 0.3 ~ "Few PF", PF >= 0.3 & PF < 1 ~ "Sometimes PF", PF >= 1 & PF < 3 ~ "Frequent PF", PF >= 3 ~ "Lots of PF"),
                    PTS = case_when(PTS < 5 ~ "Few PTS", PTS >=5 & PTS < 10 ~ "Moderate PTS", PTS >= 10 & PTS < 13 ~ "Good PTS", PTS >= 13 & PTS < 17 ~ "Great PTS", PTS >= 17 ~ "PTS Beast")
                    )

Let’s create 2 datasets - old times and modern times.

df2 <- df2 %>% select(c(-Tm, -FT, -eFG., -X3PA, -X2PA, -FG, -GS, -FTA, -Player, -PF, -ORB, -DRB))

df_old <- df2 %>% dplyr::slice_head(n = 2500) #First 2500 observations from the dataset
df_modern <- df2 %>% dplyr::slice_tail(n = 2500) #Last 2500 observations

write.csv(df_old, file="nba_basket_old.csv")
write.csv(df_modern, file="nba_basket_modern.csv")

Now let’s read the data as transactions, so we could mine the association rules. I choose 2500 players from modern times and the same number from historic times.

basket_old <- read.transactions("nba_basket_old.csv", format="basket", sep=",", skip=1) # reading the file as transactions 

basket_modern <- read.transactions("nba_basket_modern.csv", format="basket", sep=",", skip=1) # reading the file as transactions

We may start with comparing the frequencies of items between the baskets. In other words, which attributes appear the most.Out

itemFrequencyPlot(basket_old, topN=37, type="relative", main="Item Frequency for historic times")

itemFrequencyPlot(basket_modern, topN=37, type="relative", main="Item Frequency for modern times")

Although these graphs are simple, they convey important information about the way the NBA has changed over 40 years. The first noticeable difference which has been already underlined by PCA and Clustering is the number of players that do not shoot 3s in the historic times. What’s interesting is that nowadays fair 3-point efficiency is the most frequent feature among players!!! This means that nowadays it’s not something special to be a good 3-point shooter, but it’s something which most of the players can and should do if they want to be good offensive players.

In both cases young players appear more frequently than middle aged and veterans. There are relatively very few veterans in the league, as playing for long time is very taxing on athletes’ bodies.

We can see that nowadays there are much more players with excellent free throw efficiency than in the past. It’s become a very important part of the game as well. Now that big guys are learning to shoot from the distance, free throws are even easier for them.

For me personally, a very nice discovery is that nowadays there are less players who commit a lot of turnovers than in the past and the difference is visibly significant. It may be due to many reasons, such as that the way of the game is being played has changed, players got simply better or the rules are now more player friendly.

Let’s proceed with the comparison of players who scored the most points in historic times with the modern times. I am more interested in the confidence and not support. The reason being that the are not many players who are exceptional scorers. Support metric takes into account total number of observations, so the support threshold must be low, so I will go with minimum support at 0.03, which means that there should be at least 3 % of all observations, where a player has both “PTS Beast” and something else. I am setting the confidence to 0.7

rules<-apriori(data=basket_old, parameter=list(supp=0.025, conf=0.7), appearance=list(default="lhs", rhs="PTS Beast"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5   0.025      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 62 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[2575 item(s), 2500 transaction(s)] done [0.04s].
## sorting and recoding items ... [63 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.51s].
## writing ... [1395 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules.clean<-rules[!is.redundant(rules)] 

rules.clean<-rules.clean[is.significant(rules.clean, basket_old)]

rules.clean<-rules.clean[is.maximal(rules.clean)] 

rules.bysupp<-sort(rules.clean, by="support", decreasing=TRUE) 

inspectDT(rules.bysupp)

rules2<-apriori(data=basket_modern, parameter=list(supp=0.025, conf=0.7), appearance=list(default="lhs", rhs="PTS Beast"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5   0.025      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 62 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[2580 item(s), 2500 transaction(s)] done [0.03s].
## sorting and recoding items ... [64 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.37s].
## writing ... [240 rule(s)] done [0.02s].
## creating S4 object  ... done [0.00s].

rules2.clean<-rules2[!is.redundant(rules2)] 

rules2.clean<-rules2.clean[is.significant(rules2.clean, basket_modern)] 

rules2.clean<-rules2.clean[is.maximal(rules2.clean)]

rules2.bysupp<-sort(rules2.clean, by="support", decreasing=TRUE) 

inspectDT(rules2.bysupp)

Obviously, players that score a lot of points must have high FGA (shots attempted) as well. They play a lot of minutes, but they have to be efficient as well. They have a ball a lot so they naturally commit more turnovers as well.

In the historic times table, we may see that rule with a “Good Block” has quite high support 0.032 and confidence 1. We do not see this in the modern times however. This may indicate, that big athletic guys were great scorers. In nowadays table we can see more rules relating to 3-point shot.

We may also look at the general rules with the highest support for both periods.

rules<-apriori(data=basket_old, parameter=list(supp=0.1, conf=0.7))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 250 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[2575 item(s), 2500 transaction(s)] done [0.02s].
## sorting and recoding items ... [51 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.06s].
## writing ... [13891 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

rules.clean<-rules[!is.redundant(rules)] 

rules.clean<-rules.clean[is.significant(rules.clean, basket_old)]

rules.clean<-rules.clean[is.maximal(rules.clean)] 

rules.bysupp<-sort(rules.clean, by="support", decreasing=TRUE) 

inspectDT(rules.bysupp)

rules2<-apriori(data=basket_modern, parameter=list(supp=0.1, conf=0.7))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 250 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[2580 item(s), 2500 transaction(s)] done [0.02s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.04s].
## writing ... [9351 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

rules2.clean<-rules2[!is.redundant(rules2)] 

rules2.clean<-rules2.clean[is.significant(rules2.clean, basket_modern)] 

rules2.clean<-rules2.clean[is.maximal(rules2.clean)]

rules2.bysupp<-sort(rules2.clean, by="support", decreasing=TRUE) 

inspectDT(rules2.bysupp)

Inspecting the rules with high confidence and high support does not seem meaningful in this case. The rules we get are rather obvious, such as that sometimes shoot 3s and 2s have moderate FGA (shots attempted).

One may play with the interactive tool such as ruleExplorer and adjust the parameters. I am fully confident that this way may be found incredible insights about the game of basketball, but the difficulty lies in low support. Confidence here is not a problem, but given the low support, many rules are found and it’s hard to dig through them.

7. Conclusions

Running a PCA on the NBA dataset allowed us to see which components explained the most variance. The main one happens to be the offensive ability to score points, which was to be expected. Looking at the pairs of different components however, one may find many fascinating insights about the NBA, which are not visible at all when looking at the big tables of data.

We may see from the visualization of PCA results, that the NBA players nowadays are much more diverse. In the historic times, the variables on the components group together, whereas nowadays they are more spread.

Clustering of the PCA results from different seasons allowed us also to see how the league has changed over the years. We can clearly see that nowadays more players shoot 3s and more 3-point attempts are being made on average. In the 80s 3-point shot was something very new. Players had to take their time to adapt and master the art of long distance shots.

Superstars in the NBA may be considered as outliers, which results in the PAM method to be more effective than kmeans in this case. We can see from the clusters, that nowadays the players who score the most points, are also the best 3-point shooters. In the past it was the opposite, it was the era of big centers, that dominated the paint and point guards supported them, and shot occasionally as well. In today’s basketball, less and less players are unable to shoot the 3-point shot. It’s a crucial ability, that any player should learn.

Market Basket Analysis has provided us with some knowledge about the players in the NBA and it definitely expanded my understanding of basketball. However, finding incredible insights turned out to be a very time consuming and difficult process. I am sure that there are plenty more rules, which I was unable to find. I tried to recode the data to the best of my abilities in order to perform MBA, but I am aware that my solution is not perfect.

I will definitely comeback to this data at some point in my life and see what I can find and improve.

Unsupervised Learning Methods and NBA Players’ Statistics Per Game

Daniel Śliwiński

1 03 2022

Table of contents

1. Introduction - The Problem of Many Dimensions

2. Description of the Dataset

3. Data Preparation

4. Principal Component Analysis

Contribution of variables

PC1

PC2

PC3

PC4

PC5

Rotated PCs

Visualization of the PCA

PC 1 and 2

PC 2 and 3

PC 1 and 3

PC 3 and 4

PC 4 and 5

PC 2 and 4

PC 1 and 4

Results of the PCA

5. Clustering of PCA Results

PCA

# PCA of 1980 and 1981 seasons

PC 1 and 2

PC 2 and 3

PC 1 and 3

PCA of the 2020 season

PC 1 and 2

PC 2 and 3

PC 1 and 3

Clustering tendency

KMeans

Optimal Number of Clusters

Statistics

PCA 1980

PCA 2020

Statistics from NbClust

PCA 1980

PCA 2020

Kmeans clustering on 1980 PCA results

Kmeans clustering on 2020 PCA results

Partitioning Around Medoids (PAM)

PCA 1980

PCA 2020

PAM clustering on 1980 PCA results

PAM clustering on 2020 PCA results

Comments on PAM results

1980s

2020s

6. Market Basket Analysis

7. Conclusions

8. References